稀疏三元卷積類神經網路模型及其硬體設計

Title:	稀疏三元卷積類神經網路模型及其硬體設計 Sparse Ternary Convolutional Neural Network Model and its Hardware Design
Authors:	邱冠霖張添烜 Chiu,Kuan-Lin 電子研究所
Keywords:	卷積類神經網路;稀疏計算;三元計算;量化計算;類神經網路硬體;convolutional neural network;sparse calculation;ternary calculation;quantization calculation;neural network hardware
Issue Date:	2017
Abstract:	近年以來，卷積類神經網路(CNN)在機器學習相關領域中相當流行，尤其是在電腦視覺的分支，CNN 的成果相當優異。然而，目前的演算法由於計算複雜度高，因此需要仰賴強力的 GPU 來計算 CNN 模型。許多論文著力於藉由量化權重(weights)或者激發函數(activation functions)來降低運算量，但是這樣的作法會對準確度造成負面的影響。因此本論文提出逐步化簡的流程，在訓練的過程中將 weights 從浮點數化簡至三元化(Ternary)，並使 activation functions 從浮點數量化至定點數，並在適當的時間點將 Batch normalization 也一併化簡。經過這套流程後，在 ResNet-56 與 DenseNet-40 的準確率下降的程度分別為 1.61%與 3.9%。另一方面，本論文也提出一個能與之相配合的硬體。使用稀疏矩陣讀取方法(sparse matrix loading)與分群排序與合併(grouped-sort and merge)的方法，僅將非零的值輸入到我們的加速器裡，並善用被化簡至三元化的 weights，將卷積運算中的乘法代換成多工器(multiplexer)以及移位運算子 (shift operator)。在資料重用(data reuse)方面，本論文對卷積運算方面提出輸入視點卷積運算(input view convolution)，藉此降低輸入間的關聯性，且在不同的 output feature map 數量下，透過動態的處理機協同(PE cooperation)以最少的輸入作最高平行度的運算。最後在 TSMC 40nm 的製程下合成一個約 3.28M 邏輯閘數目的設計，以 500MHz 的操作頻率下，ResNet-56 搭配 CIFAR10 以及 ResNet-34 搭配 ImageNet 約分別可達到 1684FPS 以及 80FPS 的結果。 Convolutional neural networks(CNNs) blow up in the last few years. The performance is impressive especially in the computer vision field. However, the computation complexity of state-of-art models is very high. As the result, powerful GPU is needed to compute CNN models. Several works try to reduce the computation by quantizing weights and activations. But quantizing models directly may have negative effect on accuracy. Thus, this thesis propose a systematic method named progressive quantization to simplified models when training. We could simplify weights from floating point to ternary values and quantize activation from floating point to fixed point values when training models. Besides, we also simplify batch normalization at proper time. Training models through our method, the accuracy drops in ResNet-56 and DenseNet-40 are 1.61% and 3.9% respectively in our experiment. On the other side, this thesis also propose a compatible hardware. We import only non-zero values to our accelerator by sparse matrix loading and group-sort and merge method. In addition, we make good use of the ternary weights to replace multipliers to multiplexers and shift operators. As for data reuse, this thesis propose input view convolution to reduce the dependency between convolution inputs and propose PE cooperation to calculation in high level parallel with few inputs in different output feature maps. At last, an implementation synthetized with TSMC 40nm process consumes 3.28M gate counts. ResNet-56 with CIFAR10 and ResNet-34 with ImageNet could arrive 1684FPS and 80FPS respectively with the implementation under 500MHz clock frequency.
URI:	http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070450230 http://hdl.handle.net/11536/142749
Appears in Collections:	Thesis