標題: 適用於多媒體應用之高速高能量與面積效率隨機性乘法器
High-Speed and Energy-/Area- Efficient Stochastic Multipliers for Multimedia Applications
作者: 陳信凱
Chen, Shin-Kai
劉志尉
Liu, Chih-Wei
電子工程學系 電子研究所
關鍵字: 隨機性設計;投機性設計;適應性進位估測;Stochastic design;Speculating design;Adaptive carry estimation
公開日期: 2013
摘要: 新一代的多媒體應用透過提升運算複雜度以達到更好的壓縮效果,對電路運算速度要求日益增加。使用考量最差情況(worst case)之傳統設計法在設計高速電路時,往往伴隨而來的是晶片面積及能量消耗大幅提升。新一代隨機運算(stochastic computation)設計方法利用應用系統中的統計特性,維持正確性在一信心水準(confidence interval)之內,減少晶片面積及功率消耗。在隨機運算中,電路的正確性或其關鍵路徑(critical path)可被放鬆並重新定義。另一方面,搭載隨機運算單元之應用系統,其運算行為(behavior)之正確性除需仰賴隨機運算單元的架構外,還需有效的與存在於應用系統中的多餘特性(redundancy property)緊密結合,以消弭隨機運算單元所產生之隨機性錯誤,使系統效能達到要求。新一代隨機運算方法對於系統晶片及架構的設計上帶來新的契機以及挑戰。 本論文提出一利用進位估測之兩級管線(pipeline)可變延遲投機性布斯(Booth)乘法器設計。此兩級管線化布斯乘法器在管線前級將先輸出一猜測性的結果。有別於傳統管線化布斯乘法器在遇到資料危障(data hazard)時就會發生管線延遲(pipeline stall)的狀況,而造成系統效能的損失,可變延遲投機性布斯乘法器只有當發生資料危障且產生猜測錯誤時,才會發生管線延遲的狀況,因此當猜測的準確率提升時,此可變延遲投機性布斯乘法器能大幅減少管線延遲之效能傷害,確保管線設計能收到成效。與效能最高之傳統兩級管線布斯乘法器做比較,本論文所提之可變延遲猜測性布斯乘法器可節省約7%電路面積以及25.4%能量消耗,此外,所提之可變延遲猜測性布斯乘法器之系統效能也優於兩級管線布斯乘法器。我們利用JPEG壓縮、物件偵測(objective detection)及H.264解壓縮進行評估,從實驗數據中得知,所提之可變延遲猜測性布斯乘法器可維持約89~95%的猜測正確性,共可減少約1.3~1.8倍的指令執行週期(cycle)數。以乘法器實際的操作時間(即週期數週期時間)來看,與傳統效能最高之管線化布斯乘法器相比,所提之可變延遲猜測性布斯乘法器可以得到1~1.4倍不等的整體加速效果。 近似設計允許運算電路產生隨機性錯誤,可進一步簡化隨機運算單元的錯誤偵測及補償電路。近似設計之整體效益與目標應用息息相關。本論文提出一近似設計之軟硬體設計流程以及基於SystemC建構之模擬環境,可進行高階應用之週期正確(cycle-accurate)之邏輯閘層級模擬,在設計初期即可估計近似設計之效益,削減設計空間。基於所提之可變延遲猜測性布斯乘法器,本論文提出兩種變形之近似設計,分別為快速前饋(forwarding)以及單級管線設計,並利用具有錯誤容忍特性之物件偵測應用作為一案例學習。利用本論文所提出之SystemC模擬環境,在設計初期可進行近似布斯乘法器設計空間探索(design space exploration),以求得最佳之進位估測(carry estimation)方式。實驗結果顯示,所提出之快速前饋變形在維持與可信賴(reliable)設計相當的電路面積及能量消耗下,可加快17.8%速度。而單級管線變形則受到隨機性錯誤過多影響,只能增快13.2%,但其電路面積及功率消耗卻能大幅減少。相較於傳統單級管線布斯乘法器,電路面積及能量消耗相當,速度可大幅提升20.5%。
Today’s multimedia applications tend to improve compression ratio and quality by complex computations, which require high-speed circuits. Traditional technique using critical-path delay for synchronous VLSI implementation will generate enormous area and energy overheads when building high-speed circuits. Stochastic computation, on the other hand, exploits the statistical nature of both the arithmetic unit and the application-level performance metrics to reduce these overheads. Correctness is redefined and relaxed in stochastic computation circuit, which relies on architecture and/or application redundancies to maintain desired system behavior. And, new opportunities and challenges are arisen in system and architecture designs for stochastic computation. For stochastic computation, this thesis proposes a 2-stage variable-latency speculating modified Booth multiplier (VLSBM) for multimedia applications. Proposed VLSBM produces speculating result at the end of first stage. Then, the stall cycle will be hidden unless wrong speculation and data hazard occur at the same time. In this case, when applied to a DSP algorithm with a data hazard (or dependence) probability PD, 0≦PD≦1, the experimental results show that the proposed VLSBM outperforms the original Booth multiplier and the fastest conventional well-pipelined modified Booth multiplier when PD>0.32. For the case of high PD with PD1, the proposed VLSBM improves approximately 1.47 times speedup against the fastest conventional pipelined Booth multiplier (@UMC 90 nm CMOS) and, furthermore, approximately 25.4% of energy per multiplication and 7% of area are saved. By examining multiplications during three multimedia application processes (i.e., JPEG compression, object detection, and H.264/AVC decoding), the proposed VLSBM improves the speed-up ratio by approximately 1.0 to 1.4 times, and reduces the cycle count ratio by approximately 1.3 to 1.8 times in comparison to the fastest conventional two-stage pipelined Booth multiplier. Approximate designs that stochastically bypass the inaccurate results of the speculating multiplier within a confidence interval can further enhance the system performance. Behavior of acceptable inaccurate result (or called the stochastic computation error within a confidence interval) requires a high-level, cycle-accurate system synthesis in the beginning of designing the speculating multiplier. An efficient approximate design flow with SystemC-based simulation framework is proposed to evaluate application characteristics and cycle-accurate gate-level behavior simultaneously. Consequently, the design of approximate arithmetic unit or datapath can start from early design stage. To verify the effectiveness of the proposed approximate design flow, two approximate variations of variable latency speculating Booth multiplier: fast-forward and single-stage are investigated. The error-tolerant object detection is used as a case application. Suitable carry estimation is selected by proposed simulation framework for approximate design. Experimental result shows that the fast-forward approximate multiplier improves approximately 17.8% speedup while maintaining compatible area and energy consumption as compared with the reliable prototype. The single stage approximate multiplier suffers from poor accuracy. It can only run 13.2% faster than reliable multiplier. However, area and energy penalties can be greatly reduced. Compared with direct implement multiplier, the effective cycle time is 20.5% faster while area and energy penalties are neglectable.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079311629
http://hdl.handle.net/11536/73356
Appears in Collections:Thesis