標題: 算術運算模組設計及其應用於傅立葉轉換
Arithmetic Module Design and its Application to FFT
作者: 葉文昌
Wen-Chang Yeh
任建葳
Chein-Wei jen
電子研究所
關鍵字: 加法器;乘法器;傅立葉轉換;分頻正交多工器;adder;multiplier;Fourier transform;Orthogonal Frequency Division Multiplexing
公開日期: 2000
摘要: 算術運算模組設計及其應用於傅立葉轉換 研究生:葉文昌 指導教授:任建葳博士 國 立 交 通 大 學 電 子 工 程 學 系 摘要 加法和乘法在數位訊號處理之中是最基本的運算,而且被廣泛的運用在現今先進的計算器和各式不同的應用之中。雖然它們在過去的文獻之中一直被廣泛研究著,大部份的研究僅考慮演算法和架構的探討而未將運算的時程規劃一併考慮。在此論文之中,我們將研究運算模組的設計並同時以系統化的方法考量位元層次和字元層次運算的排程。 對於二補數加法,我們定義一組運算子和標記方式來探討傳統以進位前估進位法及狀況和進位法的加法演算法。由得到的方程式,我們可以分別描述進位産生的路徑以及和産生的路徑。此外,我們可藉由使用這些方程式從演算法的層次上來證明這些加法演算法可以有幾乎完全一致的空間幾何分佈。因此,我們可由此辨識出各個加法演算法在産生進位以及和時不同的特性。基於這些特性,我們提出兩個以時間最佳化爲考量的演算法。一個是雙位元預測演算法,而另一個則是一般化最早到達優先演算法。和傳統演算不同的地方在於我們提出的一般化最早到達優先演算法可使用進位前估進位法或狀況和進位法並可以完全利用輸入資料在時間上的剖面特性。 對於位元平行的乘法,它包含了三個步驟:部份乘積的産生、部份乘積的縮減以及最後的加法。對於部份乘積的産生,我們研究直接産生法和用修改式的Booth編碼方法間的差異。我們提出一個創新的修改式Booth編碼方法來讓所有部份乘積能夠在兩個互斥邏輯閘之內産生,以及加強由改進式Booth編碼方法産生的部份乘積陣列來改善在最低位元部份轉換成二補數時的效能。我們也檢驗由三維最小化演算法與改進式Booth編碼方法配合下所産生的部份乘積縮減樹。對於三維最小化演算法我們也評估使用不同的加法器所能達成的效能改善以及提出一個和以及進位分離的方法來進一步最佳化部份乘積縮減樹。而對最後加法而言,我們使用了先前所提出的一般化加法。平均而言,乘法可因此而改進約百分之十左右的效能。 而爲了要研究字元層次的運算排程,我們探討由一連串的加法和乘法所組成的快速傅立葉轉換的設計空間。在演算法層次,我們檢驗訊息流程圖來檢查以Cooley-Tukey分解法爲基礎的演算法。對於任何基於Cooley-Tukey分解法的演算法,我們更進一步的提供了一套系統化的設計方法來得到其規則化的架構。第一個架構是一維管線化的架構。由於這套設計方法的幫助,我們克服了分裂式基底演算法的不規則特性並為其設計出規則的一維管線化的架構。根據我們的管線化架構設計方法,我們進一步推導出如何設計單處理器的架構。而對於這兩種架構我們都推導出了有關其效能、輸出率和硬體需求等特性。 而許多有關於如何為分頻正交多工系統設計傅立葉轉換的問題我們在論文的最後一部份討論。一維管線化的設計被用來達成系統中有關高效能和低功率的要求。對於高速的應用而言,我們設計了一個延遲平衡式的架構將乘法中的最後加法由乘法器移到蝴蝶式單元來移除不必要的進位傳遞的加法。對於低速的應用而言,我們提出三個乘法的方法來完成複數乘法以降低硬體成本和功率的消耗。根據合成後以及佈局後的模擬,分裂式基底的管線架構可適用於高速和低速的不同應用之中。和傳統的22為基底的架構相較,我們所提出的設計可降低百分之十五的功率和百分之十四點五的周期當運作在100Mhz,3.3伏和25度C。
Arithmetic Module Design and its Application to FFT Student: Wen-Chang Yeh Advisor:Prof. Chein-Wei Jen Department of Electronics Engineering, National Chiao-Tung University Abstract Addition and multiplication are the most fundamental operations for digital signal processing and are widely employed in modern computers and various applications. Although they have been studied extensively in the literature, most of the research focuses on algorithm and architecture exploration without considering operation scheduling at the same time. In this dissertation, we shall study the design of arithmetic modules and consider bit-level and word-level scheduling problems at the same time with systematic methodology. For two's complement addition, a set of operators and notations have been developed to explore the relationships among the conventional carry-lookahead based and conditional-sum based algorithms. From the obtained formulae, we can describe the path that generates carry and the path that generates sum separately. Moreover, by using these formulae we can prove that these algorithms can have almost identical topology from an algorithm perspective. Hence, we can clarify and identify distinctive features of the carry-generation and the sum-generation for each algorithm. By exploring these features, two timing-driven generalized addition algorithms, dual-bit forward prediction (DFP) and generalized earliest-first (GEF), have been proposed to achieve high performance. Unlike traditional algorithms, the proposed GEF algorithm can use conditional-sum and carry-lookahead rules to generate optimized adder that fully exploits the features of input delay profile. For bit-parallel multiplication, it consists of three steps: partial product generation, partial product reduction, and final addition. For partial product generation, direct generation and several popular modified-Booth encoding (MBE) schemes are studied. A novel MBE algorithm has been proposed to generate the partial products within two exclusive-or (XOR) gates while the power consumption are minimized by removing spurious signal glitching in the partial product array. A new partial product array for MBE has been derived to improve the performance of the LSB part. We also examine the performance of partial product reduction tree (PPRT) optimized via three-dimensional minimization (TDM) algorithm with the proposed MBE scheme. For TDM algorithm, we evaluate the performance improvement achievable by using different full adders and present a powerful sum-carry separation technique to improve the output profile. When the generalized addition algorithms developed herein is applied to the final addition of multiplication, more than 10% speed improvement can be achieved while the hardware cost and the power consumption of the adder can also be reduced. To study word-level operation scheduling, we explore the design space of Fast Fourier Transform (FFT) algorithm, which consists of a series of addition and multiplication. At algorithm level, we inspect the algorithms based on Cooley-Tukey decomposition by examining their signal flow graphs. Furthermore, for any algorithm based on Cooley-Tukey decomposition, we provide a systematic design methodology to obtain regular hardware architectures. The first one is one-dimensional (1D) pipeline architecture. With the aid of the proposed design methodology, we can overcome the irregularity of split-radix FFT algorithm and obtain a regular 1D pipeline architecture for the split-radix algorithm. Through similar design procedure, other higher radix algorithms can also be obtained. Based on the pipeline architecture, we further derive the design methodology for single processing element architecture. For both pipeline based and the single processing element based designs, we have derived the properties regarding the performance, throughput and the hardware requirements. Various issues related to the design of FFT for OFDM system are discussed at the last part of this dissertation. One-dimensional pipeline has been adopted to meet the requirements of both high performance and low power. For high-speed application, delay-balanced hardware architecture has been proposed to remove unnecessary carry-propagation additions where the final addition of multiplication is merged into the butterfly operation to replace CPA with CSA. For low speed application, we also present a three-multiplication to further reduce the hardware cost and the power consumption. Based on the post-synthesis and post-layout simulations, split-radix pipeline architecture is recognized as a good candidate for both high-speed and low-speed applications. Compared with a conventional 64-point radix-22 pipeline, the proposed split-radix design reduces 15% power and 14.5% latency at 100Mhz, 3.3v, and 25℃.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT890428158
http://hdl.handle.net/11536/67240
顯示於類別:畢業論文