Title: 一個具平行化演算法之廣義上三角分解處理器
A systolic array based GTD processor with a parallel algorithm
Authors: 周俊瑋
Chou, Chun-Wei
楊家驤
Yang, Chia-Hsiang
電子工程學系 電子研究所
Keywords: 多重輸入多重輸出;廣義上三角分解;幾何平均值分解;可變性的架構;multiple-input multiple-output (MIMO);generalized triangular decomposition (GTD);geometric mean decomposition (GMD);reconfigurable architecture
Issue Date: 2014
Abstract: 廣義上三角分解(Generalized triangular decomposition (GTD))在訊號處理的領域中,已經被發現有非常大的功用,但至今尚未有文獻提出有效率的硬體實現方法。 本論文將介紹第一個將廣義上三角分解實現的硬體架構。 本論文提出第一個平行化的廣義上三角分解演算法,在8x8矩陣的條件下,較於傳統的循序演算法提升了1.66倍的運算速度。 本論文所提出的可變性的架構能夠處理多種矩陣分解運算,包括奇異值分解(Singular value decomposition (SVD))、幾何平均值分解(Geomertic mean decomposition (GMD))以及廣義上三角分解三種矩陣分解,且能夠處理任意矩陣大小。 本論文所設計的處理器是以脈動陣列組成,此陣列中含有多個處理核心,每個核心都是以CORDIC (Coordinate rotation digital computer)為運算單元,將大量的運算分配給多個核心,藉此提升系統吞吐量。 本論文所設計的具平行化演算法之廣義上三角分解處理器採用90 nm CMOS製程實現,核心面積為1.96 mm^2。 我們以IEEE 802.11ac標準中的延遲時間規定作為矩陣分解的運算時間之設計準則。 當頻率操作在為112.4 MHz時,硬體實現結果達到的吞吐量為每秒83k個8x8廣義上三角分解,消耗的功率為172.7 mW。
Generalized triangular decomposition (GTD) has been found to be useful in the field of signal processing, but the feasibility of the related hardware has not yet been established. This paper presents (for the first time) a GTD processor architecture with a parallel algorithm. The proposed parallel GTD algorithm achieves an increase in speed of up to 1.66 times, compared to the speed of its conventional sequential counterpart for an 8x8 matrix. For hardware implementation, the proposed reconfigurable architecture is capable of computing singular value decomposition (SVD), geometric mean decomposition (GMD), and GTD for matrix sizes from 1x1 to 8x8. The proposed GTD processor is composed of 16 processing cores in a heterogeneous systolic array. Computations are distributed over area-efficient coordinate rotation digital computers (CORDICs) to achieve a high throughput. To establish the validity of the concept, a GTD processor was designed and implemented. The latency constraint of 16 us specified in the 802.11ac standard is adopted for the hardware realization. The proposed design achieves a maximum throughput of 83.3k matrix/sec for an 8x8 matrix at 112.4 MHz. The estimated power and core area are 172.7 mW and 1.96 mm^2, respectively, based on standard 90 nm CMOS technology.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070150249
http://hdl.handle.net/11536/76452
Appears in Collections:Thesis