標題: | 新式位元層次設計方法及其應用於離散弦轉換 A Novel Bit-level Design Approach and its Application to Discrete Sinusoidal Transforms |
作者: | 陳漢臣 任建葳 張添烜 Chein-Wei Jen Tian-Sheuan Chang 電子研究所 |
關鍵字: | 迴旋疊積;離散弦轉換演;分散式算數;cyclic convolution;Discrete Sinusoidal transforms;distributed arithmetic |
公開日期: | 2005 |
摘要: | 離散弦轉換已被廣泛的應用於數位信號處理,諸如: 影像處理、數位濾波器及數位通訊等...。於低成本架構設計的研究,文獻中雖已有許多的設計,但都因只考慮到係數的常數特性而未真正有效的著眼於不同演算法中這些係數的數值特性,因此高效能低成本離散弦轉換之架構設計仍有極大的著墨空間。對此,本論文提出了一低記憶體成本之位元層次設計方法並應用於高效能低成本之離散弦轉換架構設計上。
本論文以迴旋疊積離散弦轉換演算法為基礎,同時利用分散式算數將輸入資料分解至位元層次進而去除潛在的冗餘而提出名為群組式分散式算數之低記憶體成本之位元層次設計方法;對於一個N 點的迴旋疊積運算,所提出之新的分散式算數設計方法僅僅使用了一組遠小於傳統式分散式算數設計方法的記憶體、一組N位元之移位暫存器、及N個累加器。跟據輸入資料之迴旋特性,我們重新安排了分散式算數架構中記憶體的內容進而消除了原先儲存於記憶體中重複出現之係數和而達到降低硬體成本之目的。與傳統之分散式算數設計比較,所提出之群組式分散式算數設計可使記憶體成本由 降至 ;若考慮額外付出的硬體代價,硬體成本則由 改善至 。此外,為了使所提出之群組式分散式算數設計方法可應用於長點數之設計,群組式分散式算數之分割問題是在提出一種新的分散式算數方法時必須面對的。對質數點數及非質數點數我們分別結合了Agarwal-Cooley及Pseudocirculant matrix factorization等分割演算法進行迴旋疊積之分割,這樣的結合使得群組式分散式算數設計方法在低成本長點數的迴旋疊積設計上一併得到了解決方案,也進而提升了此二分割演算法在實際應用上之價值。
在離散弦轉換的實現上,為使能夠更進一步降低硬體成本,在離散傅利葉轉換的設計中我們進一步利用了其中係數之對稱性,使得在群組式分散式算數之離散傅利葉轉換設計上可再降低一半的記憶體成本。而在離散餘絃轉換的設計中,由於迴旋疊積的不完美,為使群組式分散式算數方法能順利的應用於離散餘絃轉換的設計之中,我們亦利用了離散餘絃轉換中係數之對稱性將原來的迴旋疊積演算法轉換成一完美的迴旋疊積演算法,進而使得一個低成本的群組式分散式算數離散餘絃轉換架構得以實現,這樣的一個處理也使得群組式分散式算數在離散餘絃轉換的實現上亦減少了一半的記憶體成本。與現存的心脈式陣列架構及其他分散式算數架構之離散絃轉換設計比較,所提出之群組式分散式算數架構可節省超過29% 的延遲時間-硬體成本乘積值。
考慮在通訊系統上的應用,本研究最後嘗試使用所提出之低硬體成本群組式分散式算數設計方法來實現長點數且為可變點數之二的次方長度之離散傅利葉轉換。我們使用Cooley-Tukey 演算法先對離散傅利葉轉換進行分解,再使用pseudocirculant matrix factorization 演算法對分解後的離散傅利葉轉迴旋疊積式進行進一步的分割,使得一長點數的問題仍可利用低硬體成本之群組式分散式算數加以實現。所提出之以群組式分散式算數設計為基礎的可變點數離散傅利葉轉換架構可適用於64/128/256/512/1024/2048/4096 等長度之離散傅利葉轉換。此外,所提出之架構亦適用於任意長度之離散傅利葉轉換實現。與現存的長點數及可變點數FFT架構比較,除了潛在延遲較短及高硬體使用率的優點外,在單位產出率下,當長度小於256時,本架構可節省超過 9.6% 的硬體成本;因此,所提出的是一個具相當競爭力的硬體架構實現。除了上述有關離散弦轉換的應用外,本論文所提出之設計方法亦適用於任何有關迴旋運算的數位信號處理方面的應用上。 The Discrete Sinusoidal transforms (DSST’s) have been widely used in many digital signal processing applications such as image processing, digital filtering, digital communication, and etc. Although many designs of the DSST’s have been proposed in the literatures, their designs are still not efficient enough since they exploit only the constant property of the transform coefficients without considering the numerical property of these coefficients in the reformulated algorithms to further optimize the hardware cost. This dissertation proposes a novel bit-level hardware-efficient group distributed arithmetic (GDA) design and its applications for DSST’s designs. In the proposed GDA design approach, first we formulate the algorithm of DSST’s into cyclic convolution form in algorithm level. Then we use the distributed arithmetic to decompose the input data into bit-level in architecture level. Thus, the data redundancy due to the cyclic convolution can be efficiently removed within the bit-level input context to facilitate a hardware efficient DA realization. The proposed GDA approach rearranges the contents of DA memory according to its cyclic property such that redundancy of the contents can be eliminated and only a few groups of data are needed. Thus, compared with the conventional DA design, the memory cost of the proposed GDA design can be reduced from to , and accounting with the necessary overhead, the overall complexity is improved from to . To further extend its applications to long length designs, we further combine the Agarwal-Cooley algorithm and Pseudocirculant matrix factorization algorithm. This can partition the long length cyclic convolution into short ones while can still maintain its cyclic property, which avoids the non-cyclic problem of direct partitioning. Thus the proposed GDA design can efficiently be applied to realize each of the shortened cyclic convolution blocks to achieve low hardware cost. The proposed GDA design approach has been applied successfully to the DFT, DHT and DCT designs. For DFT design, we further combine the symmetrical property of the DFT coefficients with the proposed GDA design approach such that this design requires only half the contents to be stored. This further reduces the memory size by a factor of two. For the DCT design, in addition to the symmetry property of DCT coefficients, we further reformulate the non-cyclic DCT kernel into two perfect cyclic forms such that the DCT can be implemented by the GDA design approach with less hardware of (N-1)/2 adders or substractors, one much small memory module, a (N-1)/2-bit barrel shifter, and (N-1)/2+1 accumulators. Compared with the existing systolic array designs and DA-based designs, the realizations of 1-D DFT, DHT, and DCT with the proposed GDA design approach reduce the delay-area product more than 29% according to a 0.35 um CMOS cell library. In addition to the prime length design, we also apply the GDA approach to the long length power-of-two DFT design commonly used in the communication system. We combine the proposed hardware efficient GDA approach with the Cooley-Tukey algorithm on DFT decomposition, and pseudocirculant matrix factorization algorithm on cyclic convolution partitioning to facilitate the long- and variable-length DFT design with low hardware cost. The proposed design can be flexibly used to compute the 1-D 64/128/256/512/1024/2048/4096-point DFT by cascading two 1-D short length DFTs and summing up the partitioned short length cyclic convolutions for each stage of the cascaded DFT. Besides, the proposed hardware efficient design approach can also be adopted in the design with the length beyond power of two. Compared with the existing long-length and variable-length FFT design, in addition to the advantages of short latency and high hardware utilization efficiency, under the same throughput rate, the proposed variable-length DFT can be a competitive design, and save the hardware cost more than 9.6% while the transform length is smaller than 256. In summary, the presented GDA-based design approach provides a solution to efficiently implement not only the DSST’s but also the DSP applications involving convolution and correlation. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT008811837 http://hdl.handle.net/11536/55112 |
顯示於類別: | 畢業論文 |