標題: | 前瞻數位訊號處理器之微架構設計 Microarchitecture Designs for Advanced Digital Signal Processors |
作者: | 林泰吉 Tay-Jyi Lin 任建葳 劉志尉 Chein-Wei Jen Chih-Wei Liu 電子研究所 |
關鍵字: | 數位訊號處理器;暫存器組;超長指令字元;Digital Signal Processor;Register File;Very Long Instruction Word |
公開日期: | 2005 |
摘要: | 現今的通訊系統或多媒體應用動輒需要每秒數十億的運算。以目前的積體電路技術製作一顆工作頻率在數百或數千兆赫的晶片,其中包含數十至數百個平行算術運算單元,並不非常困難。但如何有效率地在這些平行操作的運算單元間進行資料傳輸及交換,同時提供其在每時脈所需要的運算元則是極大的挑戰。這些平行運算單元間的動作協調與同步也需要妥善規劃,才能適用於多數的嵌入式系統。我們在此論文深入探討可程式化處理器之微架構設計技巧,用以降低運算單元間的通訊複雜度。針對數位訊號處理器,我們提出了簡化之叢集間通訊機置及創新的分散式乒乓暫存器組織,在聯電0.13微米銅製程的實驗結果顯示,其可省下76.8%的晶片面積與46.9%的操作時間。另外,以超長指令字元(VLIW)在指令層次進行運算單元的動作協調及同步,可降低處理器的硬體複雜度,程式執行時間也容易估算,但此種方式的程式碼密度極差。我們在此論文提出了完整的解決方案,包含有彈性的可變長度指令編碼、無效指令(NOP) 移除及自動程式碼複製三種技巧。我們分別使用人工最佳化及編譯器產生之程式碼進行模擬,數據顯示其將機械碼縮小為原來的四分之一。最後,我們以一個完整的超長指令字元數位訊號處理器設計驗證所提出微架構設計之可行性,包含其指令集的設計模擬、微架構的探究與實作、FPGA雛型機設計展示至晶片下線。以聯電0.13微米銅製程完成之實作可操作在333兆赫,而包含128KB資料記憶體與32KB程式記憶體之晶片面積為3.2mm×3.15mm。平均功率消耗為189毫瓦。 Today’s wireless and multimedia applications demand multi-billions operations per second. Owing to the advances in IC technology, it is not difficult to fabricate tens to hundreds of arithmetic units in a hundred-MHz or few-GHz processor to achieve the required performance. However, the complexity of data generation and operation coordination/synchronization of these parallel arithmetic units is prohibitive in most embedded systems. This dissertation first studies microarchitectural techniques that reduce the communication complexities of parallel arithmetic units. We propose a simple inter-cluster communication (ICC) mechanism with load/store instruction pairs and a novel distributed & ping-pong register organization for digital signal processors. In our experiments in UMC 0.13μm 1P8M Copper Logic Process, the area and the timing are saved by 76.8% and 46.9% respectively. On the other hand, we study very long instruction word (VLIW) execution schemes with improved code density in this dissertation. We propose a unified VLIW encoding scheme with flexible variable- length instruction encoding, NOP removal, and automatic instruction replication to improve the code density. In our simulations with both hand-optimized and compiled codes, the proposed approach saves 74.0%~75.9% code sizes. Finally, a complete VLIW DSP with our proposed improvements is implemented and verified from instruction set simulation in C++, microarchitecture exploration in SystemC, FPGA prototyping and chip tapeout. The silicon implementation in UMC 0.13μm 1P8M Copper Logic Process operates at 333MHz. Its core size is 3.2mm×3.15mm including 128KB data memory and 32KB instruction memory. The average power consumption is 189mW. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT008711591 http://hdl.handle.net/11536/41001 |
Appears in Collections: | Thesis |