Title: | 適用於卷積神經網路應用之高精準度高效益靜態浮點數運算外積陣列處理器 A High-accuracy and Cost-effective SFP SUMMA Array Processor for CNN Inference Application |
Authors: | 李其駿 劉志尉 Li, Chi-Jiun Liu, Chih-Wei 電子研究所 |
Keywords: | 神經網路;加速器;靜態浮點數運算;陣列處理器;卷積神經網路;convolution neural network;accerlerator;static floating point;CNN inference;array processor |
Issue Date: | 2017 |
Abstract: | 深度卷積神經網路(DCNN)需要大量的計算,我們對此提出一低複雜度且高精確度運算引擎,利用靜態浮點(Static Floating-Point, SFP)算術運算,可讓運算都操作在有效或非零位元上,以提高能量效益,並且在有限的8位資料位寬的情況下達到更高的正確率。此外,我們導入Scalable Universal Matrix Multiplication Algorithm (SUMMA)的資料排程,能有效避免重複儲存資料,而且資料可以廣播的方式傳到所需要的運算引擎。嵌入之微量型資料流介面單元儲存器,可大幅降低運算元對於暫存器組織的存取頻率,以降低能量損耗。模擬結果顯示,與MIT Eyeriss加速器相比,在Alexnet 5層卷積層的運算中,在提供相當的吞吐率(Throughput)的情況下,我們所設計的SFP SUMMA深度推論加速器可節省約40%的功率損耗(167mW vs. 278mW);利用ImageNet資料庫,我們所設計的SFP SUMMA深度推論加速器可提供約56.47%之Top-1準確率(註: GPU的Top-1準確率約為56.90%),而MIT Eyeriss加速器僅可提供約50.18%準確率。利用TSMC 90 nm CMOS製程技術下,所提出之SFP SUMMA DIP可提供0.45 TOPs/W的效能。反觀,在執行相同Alexnet 5層卷積層的運算中,MIT Eyeriss加速器僅提供約0.3 TOPs/W(@65 nm CMOS)。 We propose a high-accuracy and cost-effective array processor for Deep Convolution Neural Network (DCNN) inference application. The proposed Static Floating-Point (SFP) arithmetic allows the MAC operations operated on non-zeros bits of data. This will guarantee the energy efficiency as well as the accuracy of the proposed computing engine. Moreover, applying scalable universal matrix multiplication algorithm (SUMMA), we avoid storing repeated data in the local storage, and data can be broadcasted to corresponding PEs. With the proposed simple stream interface unit (SIU), the proposed design can greatly reduce the access frequency of operands (data or weights) being read/written from/to the central register file (CRF), and minimize the power consumption. Simulation results reveal that the proposed SFP SUMMA array processor can achieve approximately 56.47% top-1 accuracy performance and consume only 167mW. Synthesized by TSMC 90 nm CMOS technology, the proposed SFP SUMMA DIP achieves 0.45 TOPs/W. On the contrary, performing the same work load of the 5 convolutional layers within Alexnet, the performance of MIT Eyeriss is only 0.3 TOPs/W (@65 nm CMOS). |
URI: | http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070450242 http://hdl.handle.net/11536/142855 |
Appears in Collections: | Thesis |