適用於卷積類神經網路之高效率硬體加速器設計

標題:	適用於卷積類神經網路之高效率硬體加速器設計 Data and Hardware Efficient Design for Convolutional Neural Network
作者:	林岳縉張添烜 Lin, Yue-Jin Chang, Tian-Sheuan 電子研究所
關鍵字:	卷積類神經網路;超大型積體電路;特殊應用積體電路;加速器;Convolutional neural network;VLSI;ASIC;Accelerator
公開日期:	2016
摘要:	近年來深度卷積類神經網路(CNN)在辨識、偵測以及相當多的電腦視覺應用上均有不錯的成果。但由於其具有高計算複雜度、大量待處理資料、以及高變異的網路結構特性，使得CNN在硬體實現上有相當大的困難。一般而言，卷積層(convolutional layer)的效能受到計算資源的限制，且全連接層(fully connected layer)的效能侷限於資料頻寬的大小。因此需要有高彈性的硬體來加速CNN。本論文提出一個支援完整網路的CNN加速器。內部主要技術包含運行時可調整濾波器核心的結構來達到硬體使用率最大化，以及輸出優先策略使卷積層上的資料重複使用高達300到600倍，如此便可以降低資料頻寬。除此之外，針對考慮的網路結構，我們會根據設計上的限制來產生其最佳硬體和資料使用率的結果來實現適應網路變化以及即時的CNN加速器。最後以AlexNet為例子，我們利用TSMC 40nm製程合成一個花費大約1.783M邏輯閘數目的設計，其中包含了216組乘加器和142.64 KB的內部儲存。在454 MHz的工作頻率下，它可以分別在只有卷積層的AlexNet和整個AlexNet達到99.7 fps和61.6 fps的結果。 Deep convolutional neural networks (CNNs) have achieved state-of-the-art accuracy on the recognition, detection, and other computer vision fields. However, its hardware design faces challenges of high computational complexity and data bandwidth as well as huge divergence in different CNN network layers. In which, the throughput of the convolutional layer would be bounded by available hardware resource and throughput of the fully connected layer would be bounded by available data bandwidth. Thus, a highly flexible design is desired to meet these needs. This thesis will present our end-to-end CNN accelerator that maximizes hardware utilization to 100% with run-time multiple kernel size configurations and minimizes data bandwidth with the output first strategy to improve data reuse of the convolutional layers by up to 300X~600X compared to the non-reused case. The whole CNN implementation of the target network is generated optimally for both hardware and data efficiency under design resource constraints, and this implementation is run-time reconfigured by the layer optimized parameters to achieve real-time and end-to-end CNN acceleration. An implementation example for Alexnet consumes 1.783M gate count for 216 MACs and 142.64 KB internal buffer with TSMC 40nm process, and achieves 99.7 fps and 61.6 fps under 454 MHz clock frequency for the convolutional layers and all layers of the AlexNet respectively.
URI:	http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070350222 http://hdl.handle.net/11536/139467
顯示於類別：	畢業論文