標題: 位元層次內積運算之超大型積體電路架構設計
VLSI ARCHITECTURE DESIGN FOR BIT-LEVEL INNER PRODUCT
作者: 張添烜
Tian-Sheuan Chang
任建葳
Prof. Chein-Wei Jen
電子研究所
關鍵字: 共享子表示式;數位傅立葉轉換;數位餘弦轉換;迴旋表示式;低功率設計;可程式化濾波器;common subexpression sharing;DFT;DCT;cyclic formulation;low power;programmable filters
公開日期: 1999
摘要: 內積運算為數位訊號處理之應用,例如多媒體處理和通訊系統中的一個重要建構單元。由於應用範圍的廣泛,如何有效率的實現來滿足不同的應用需求成為一重要研究課題。在本論文中,我們對此課題以探討內積運算的位元層次設計空間加以研究,在探討中將包含可程式化和不可程式化的運算元。 對不可程式化的內積運算,我們考慮固定運算元的常數及數值特性,以使所得的乘法器為一硬體連線並具有共享子表示式之特性。因此,我們提出一個新的分散式算術技巧,此技巧將固定的輸入展開為位元層次,且我們利用共享部分乘積之和與固定輸入的稀疏非零位元特性,來減少運算數目。此所提的分散式算術技巧,已成功地運用在一個二維反數位餘弦函數晶片設計、一個處理器核心設計和一個FPGA實現上。此所提的處理器核心設計、可應用於即時H.263壓縮和數位相機之應用。在設計中,我們已將分散式算術技巧運用到極至:只用一個字元加法器和移位器。再者,此設計更進一步結合快速直接二維數位餘弦函數演算法可減少運算週期。本設計方法具簡單、規則、且容易擴展至其他更高輸出率的應用特性。對FPGA的實現,由於FPGA本身具位元運算特性,因此所提的分散式算術技巧非常適合於此結構,所得之架構設計和傳統的的分散式算術相比,可減少超過三分之二的硬體代價。 除了上述利用共享子表示式作架構最佳化外,我們也考慮演算法的重新列式。演算法的重新列式是將轉換的數學式重組為迴旋表示式,以促成更好的共享子表示式架構。我們提出兩個新的有效率的數位傅立葉轉換架構,此所提的架構結合數位傅立葉轉換係數的對稱性以增加所得的輸出率。對質數長度數位傅立葉轉換設計,可節省80%的閘面積且具有兩倍的輸出率。對二的乘冪長度的數位傅立葉轉換,和現有最佳設計相比,我們的設計具有相當競爭性的面積-時間複雜度。 針對可攜式應用環境,我們也提出了利用差分係數和差分輸入的低功率設計技巧。使用差分技巧比起使用直接輸入和係數,所需要的位元數要少,因此可以減少運算單元的大小而降低功率消耗。我們提出一個改進的演算法以有效的產生差分係數,並使差分係數方法可應用於濾波器的全頻寬而非如先前方法只用於窄頻濾波器。使用固定係數濾波器的模擬結果顯示變化動作減少幅度在全頻寬上可從1%到53%,面積可減少至50%,所得的設計比起先前方法所得的結果在應用廣度、功率消耗和面積上都較為優越。 對可程式化的濾波器設計,我們提出數字串列的架構,此設計在演算法層面用分散式算術技巧加以重組以免除累加的運算,並使用(p, q)壓縮器取代Booth編碼以得到高速運算。所得之架構和先前的設計相比,可節省達17%的硬體面積。
Inner product is an important building block to many DSP applications such as multimedia, wireless and communication systems. Due to the wide range of applications, the study on efficient implementations to meet different application requirements becomes an important research topic. In this dissertation, we study this topic by exploring the bit-level design space of inner product, including both programmable and non-programmable operands. For non-programmable inner product, we explore its design space by considering the constant and the numerical property of the fixed operands such that the resulting multiplication is a hardwired one with common subexpression sharing. Thus, we propose a new distributed arithmetic (DA) technique that expands the fixed input into bit level so that we can take advantage of shared partial sum-of-products and sparse nonzero bits in the fixed input to reduce the number of computations. The proposed DA has been applied to a 2-D IDCT chip design, a processor core design, and FPGA implementations. The processor core design, which can be used in digital still camera and real time H.263 encoding, explores the sharing properties of the proposed DA to the extreme case: only one word adder and shifter. Furthermore, it may combine the fast direct 2-D DCT algorithm to reduce the computation cycles. The resulting architecture is quite simple, regular and easily scalable to other higher throughput applications. For FPGA implementations, due to its bit level grain size, the design with well-suited proposed DA can offer savings in excess of two-thirds of hardware cost, when comparing with the design by using conventional DA. Besides architecture optimization with common subexpression sharing, we also consider the algorithm reformulation. The algorithm reformulation formulates transform equations into cyclic convolution form to enable better sharing with common subexpression. We have proposed two efficient DFT designs that also combine the symmetry property of DFT coefficients to increase the resulting throughput. The prime-length DFT design can save 80% of gate area with two-times fast of throughput for length N=61. The power-of-two length DFT design achieves competitive area-time complexity comparing with previous designs. For portable applications, we also consider low power filter realization by using differential coefficients and inputs instead of using them directly such that fewer bits are required thereby reducing the size of arithmetic units and power dissipation. We present an improved algorithm to effectively generate differential coefficients so that the differential coefficients methods can be applied to full bandwidth of filters instead of only narrow band filters in previous approaches. Simulations with fixed coefficient filters indicate reduction in transition activity ranging from 1% to 53% over the full range of filter bandwidths. Reduction in area can be up to 50% due to less coefficient precision. The resulting design is superior to the one with previous approaches in applicability, power consumption, and area. For programmable filters, we present a digit-serial architecture that uses DA form in the algorithm level for accumulation-free operations, and (p, q) compressor instead of Booth encoding for high-speed operations. The resulting design can save up to 17% hardware cost comparing with the previous approach. 1.1 BIT-LEVEL INNER PRODUCT 1 1.2 CLASSIFICATION RULE 3 1.3 WORD-LEVEL DESIGN 7 1.4 DESIGN WITH BIT-LEVEL INPUTS 8 1.5 DESIGNS WITH BIT-LEVEL COEFFICIENT 9 1.6 SUMMARY OF THE DESIGN TECHNIQUES 11 1.7 DISSERTATION ORGANIZATION AND CONTRIBUTION 12 CHAPTER 2 DESIGN AND APPLICATIONS OF COMMON SUBEXPRESSION SHARING 15 2.1 REVIEW OF COMMON SUBEXPRESSION SHARING 15 2.2 ADDER-BASED DA 17 2.2.1 Algorithms and architectures of DA 18 2.2.2 Architecture design 19 2.2.3 Precision analysis 20 2.2.4 Comparisons with relevant approaches 21 2.2.5 Design issues of the adder-based DA design 22 2.3 2-D IDCT PROCESSOR WITH ADDER-BASED DA 24 2.4 A PROCESSOR-CORE DESIGN FOR DCT/IDCT 26 2.4.1 Design techniques 27 2.4.2 Designs of processor core 29 2.4.3 Hardware cost and performance 37 2.4.4 Applications and comparisons 39 2.5 HARDWARE-EFFICIENT IMPLEMENTATIONS FOR TRANSFORMS IN FPGAS 42 2.5.1 FPGA-based transform design 44 2.5.2 Interface circuit design 49 2.6 SUMMARY 51 CHAPTER 3 TRANSFORM DESIGNS WITH CYCLIC FORMULATION AND COMMON SUBEXPRESSION SHARING 53 3.1 WHY CYCLIC FORMULATIONS? 53 3.2 PRIME-LENGTH DFT DESIGNS 55 3.2.1 Algorithm formulation of prime-length DFT 55 3.2.2 Architecture design 58 3.2.3 Performance analysis and comparisons 61 3.3 POWER-OF-TWO LENGTH DFT 63 3.3.1 Algorithm formulation 64 3.3.2 Architecture design 68 3.3.3 Performance analysis and comparisons 73 3.4 SUMMARY 78 CHAPTER 4 LOW POWER DIGITAL FILTER REALIZATION WITH DIFFERENTIAL COEFFICIENTS AND INPUTS 79 4.1 ALGORITHM FORMULATION 81 4.1.1 Review of DCM algorithm 81 4.1.2 Proposed algorithm 82 4.1.3 Determination of alpha and beta 84 4.2 RESULTS 86 4.2.1 Analytical model and results 86 4.2.2 Simulation model 87 4.2.3 Simulation results 88 4.3 SUMMARY 93 CHAPTER 5 HARDWARE EFFICIENT PROGRAMMABLE FILTER DESIGN 95 5.1 ALGORITHM REFORMULATION 97 5.2 ARCHITECTURE DESIGN 98 5.2.1 Architecture design of multi-bits DA 98 5.2.2 Extension to higher radix designs 102 5.2.3 Comparison to other designs 105 5.3 LOGIC LEVEL DESIGN WITH THE DOUBLE EDGE TRIGGERED DFFS 106 5.4 SUMMARY 108 CHAPTER 6 CONCLUSIONS AND FUTURE DIRECTIONS 109 6.1 CONCLUDING REMARKS 109 6.2 FUTURE DIRECTIONS 110 APPENDIX A THE GENERAL FORMULATION OF RADIX-2C DFT ALGORITHM 113 REFERENCES 115
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT880428002
http://hdl.handle.net/11536/65633
Appears in Collections:Thesis