標題: 高效能具高平行度及自然輸入輸出特性之快速傅利葉轉換處理器架構設計
Design of High-Throughput FFT Processor with High Parallelism and Normal Input/Output Ordering
作者: 黃紳睿
Huang, Shen-Jui
陳紹基
Chen Sau-Gee
電子研究所
關鍵字: 快速傅利葉轉換;正交分頻多工;管線式架構;計憶體式架構;無線個人網路;Fast Fourier Transform;OFDM;Pipelined architecture;Memory-based architecture;WPAN
公開日期: 2012
摘要: 近年來正交分頻多工 (Orthogonal Frequency Division Multiplexing; 簡稱OFDM) 技術已廣被各種新的及未來寬頻通訊標準所採用, 例如無線區域網路、 無線都會網路、 數位視訊廣播及第四代行動通訊系統等等。 而快速傅立葉轉換 (Fast Fourier Transform;簡稱FFT) 則是OFDM 系統所需用到的最主要關鍵運算之一。由於目前及未來寬頻通訊系統之規格及資料傳輸率要求非常高,使得FFT處理器之設計面臨更多問題待克服。因此,在本論文針對如下涵蓋各個面向的主要FFT設計挑戰問題提出新的解決技術: (1)針對加強計算能力之挑戰問題,我們提出高基底計算之優化設計。有別於傳統廣為使用之基底-8設計,我們提出以基底-42 演算法為基礎,經由折疊處理 (folding),設計出兩種具有16倍平行度,且擁有高產出率(throughput) 及低面積複雜度之核心處理單元 (processing element)。(2)針對最主要之twiddle factor乘法運算之簡化挑戰問題,我們提出了一種新式的不須使用任何twiddle factor 記憶體之管線式 (pipelined)乘法器架構,可支援可變長度快速傅立葉轉換運算, 最高到32768點運算。相較於傳統上使用twiddle factor表配合複數乘法器之架構或是座標旋轉架構 (CORDIC),此架構具有較佳之面積效益,且更適合於高速之應用。(3)針對解決FFT運算時資料存取的衝突與提供資料自然輸出入排序的挑戰問題(特別是在多倍平行架構下此問題挑戰性更大),我們提出了兩種新式免於衝突之記憶體位置定址法(memory addressing scheme),不僅具有同址 (in-place)特性,而且於快速傅立葉轉換運算完成時,具有自然輸出順序之特性,因此不需任何的重新調序緩衝器 (Reordering Buffer)。此特性可利於快速傅立葉轉換處理器與其他功能區塊如通道估測器(Channel Estimator)或是等化器(Equalizer)之整合。最後,為獲得較佳之訊號-量化雜訊比(SQNR),我們提出改良之區塊浮點(Block-floating point)技術,於運算過程中不斷地對運算資料去做偵測及調整刻度。透過動態調整蝶式運算之次序,不僅利於刻度對齊 (scaling alignment),更可使指數記憶體作有效率之存取。 於本論文中我們提出三種快速傅立葉轉換處理器設計。首先針對IEEE 802.15.3c 系統,我們設計一個產出率(throughput)能達到2.59 GS/s之512點快速傅立葉轉換處理器。利用前述基底-42 演算法,經由1/4摺疊設計出一個具有16倍平行度之高效能核心處理單元。所提出的符合802.15.3c 規格的快速傅立葉轉換處理器以90nm CMOS製程實現時,其核心面積為0.93 mm2,功率消耗為 42 mW。 在第二個設計中,針對超長點數之快速傅立葉轉換,利用前述基底-42 演算法,經過1/2折疊處理以及適當之排程後,設計出一個具有16倍平行度,且極具面積效益之核心處理單元。藉由內部交換器組態的變化,可執行1024 ~ 32768點不同長度的快速傅立葉轉換運算。所提出的快速傅立葉轉換處理器以 90nm CMOS 製程實現時,其核心面積為2.98 mm2,在操作時脈160MHz時消耗功率為 29 mW。在16位元時,對於長度 1024 ~ 32768 之運算點數均能達70dB以上之SQNR。在第三個設計中,針對IEEE 802.16e (WiMAX) 無線傳輸系統,我們提出了一個極具記憶體面積效益之快速傅立葉轉換處理器。利用其上傳載波部份使用模式 (UL-PUSC) 傳輸格式之特性,我們區隔不同之記憶體以分別存放快速傅立葉轉換輸入資料及運算過程中之資料。透過所提出之記憶體循環使用的排程機制,在連續傳輸模式下,整個系統所須的記憶體可大幅減少 50 % 以上。
Since recent decades, OFDM (Orthogonal Frequency Division Multiplexing) technique has been widely used in various new and emerging broadband communication standards, such as Wireless Area Network (WLAN), Wireless Metropolitan Network (WMAN), Digital Video Broadcasting (DVB), 4G mobile communication systems and etc. Fast Fourier Transform (FFT) is one of the key operations in OFDM-based systems. Since the specification requirement and data transmission rate of the current and future broadband communication systems are much higher than ever, there exists various challenging design issues in all aspects for FFT processor design. In this dissertation, we respond to all those challenging problems and propose new effective solutions for significant performance improvements of FFT processor design as follows. First, for the challenge of enhancing computing capability of processing element (PE), two optimized PE designs based on high-radix FFT algorithm are proposed. Unlike the conventional widely-used radix-8 design, by utilizing radix-42 FFT algorithm and proper folding schemes, two 16-parallel kernel processing engines are devised, which have high-throughput and low area-complexity features. Secondly, for the challenge of simplifying twiddle factor multiplications, a new pipelined multiplier architecture is proposed, which achieves low-complexity twiddle factor multiplication operations and eliminates the memory need for storing any twiddle factors. The proposed architecture can be flexibly configured to support variable FFT lengths, up to 32768-point. Compared to the conventional scheme using complex multipliers with ROM tables or CORDIC-based architectures, the scheme has better area efficiency, and is more suitable for high-speed applications. Thirdly, for the challenge of solving memory access conflict problem during FFT operations and providing normal-order input/output capability, especially for PE architectures with high parallelism, two conflict-free memory addressing schemes are devised. The proposed addressing schemes have in-place feature as well as normal-order data output property once FFT operation is completed. Hence, no reordering buffer is required. With the help of such feature, FFT processors can facilitate easy integration with other function blocks, such as channel estimator or equalizer. Finally, for the challenge of fixed-point realization with high SQNR performance, an improved block floating-point scheme is developed, which dynamically performs data scale detection, scaling operations, and scale alignment operations during FFT operations. In addition, a scaling-dependent butterfly execution scheme is proposed, which facilitates smooth scale alignment and efficient accesses to exponent memory. Three different FFT processors for state-of-the-art OFDM communication systems are proposed in this dissertation, which take advantage of the mentioned proposed techniques. As such, high-performance FFT processors are obtained. First, a 512-point FFT processor with throughput 2.59 GS/s has been designed for WPAN system. Based on the aforementioned radix-42 FFT algorithm and folded-by-4 scheme, a 16-parallel high performance kernel processing engine can be realized. The proposed design is implemented with 90nm CMOS process, the core area is 0.93 mm2 and the power consumption is 42 mW under 324 MHz clock rate. In the second design, the radix-42 FFT algorithm is applied. With folded-by-2 operation and proper scheduling, a high performance kernel with 16-way parallelism as well as high area efficiency is proposed for ultra-long FFT length. Variable FFT lengths ranging from 1024- to 32768-point can be performed flexibly through internal switch configurations. The proposed design is implemented with 90nm CMOS process, the core area is 2.98 mm2 and the power consumption is 29 mW under 160 MHz clock rate. The SQNR is over 70 dB under 16-bit wordlength for FFT lengths ranging from 1024 ~ 32768. In the third design, a memory-efficient FFT processor for Wimax system is proposed. By exploiting the features of UL_PUSC (Uplink Partial Usage of Subchannel) mode, we separate memory banks to store the FFT input data and intermediate data during FFT operations individually. Besides, through the proposed circular memory scheduling scheme, the required memory space can be greatly reduced more than 50 % for continuous-mode operations.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079511831
http://hdl.handle.net/11536/41063
顯示於類別:畢業論文