低功耗高效能多核心視訊解碼器設計

標題:	低功耗高效能多核心視訊解碼器設計 Low-Power and High-Performance Multi-core Video Decoder
作者:	翁綜禧 Weng, Tsung-Hsi 鍾崇斌 Chung, Chung-Ping 資訊科學與工程研究所
關鍵字:	低功耗;平行化;多核心;視訊解碼;去方塊濾波器;解碼圖片緩衝器;Low-Power;Parallel;Multi-core;Video decoding;Deblocking filter;Decode Picture Buffer
公開日期:	2015
摘要:	隨著多核心處理器架構成為系統設計的趨勢，若能將運算妥善分配至多核心的各處理單元中，視訊解碼的計算時間將可被大幅縮短。然而考量到資料存取方面，隨著處理器與隨機存取記憶體(DRAM)間的效能差距逐漸擴大，這將可能是另一個運算瓶頸所在。在H.264視訊解碼器中，解碼過程中的Deblocking佔有了約三分之一的運算量，而Decode Picture Buffer則使用了視訊解碼器大部分的記憶體使用需求並對硬體成本佔有極高的影響力。故本論文針對視訊解碼器中的此兩項功能進行深入研究。在本論文中，針對16-pixel-long以及4-pixel-long兩種Deblocking的範圍大小提出平行化運算之設計。相較於固有的2D wave-front方法，在假設不限制硬體所能提供平行運算處理的單元數量下，針對16-pixel-long提出的設計在1920x1080及1080x1920兩種畫面大小上，分別獲得了1.57倍與2.15倍的提升；針對4-pixel-long提出的設計則進一步分別獲得了1.92倍與2.44倍的提升。另一方面，針對DPB具有大量記憶體需求的部分，本論文也提出了減少DPB記憶體使用量之設計，採用此設計之視訊解碼器並不影響其解碼運算的效能，並且相較於傳統的H.264視訊解碼器，在畫面品質(PSNR)僅損失0.32~1.98dB的情況下，減少了60%至70%之儲存空間需求。進一步來說，隨著未來視訊畫面大小的成長，本論文提出的平行化運算設計可將解碼時間成長的幅度限縮在畫面大小成長幅度的平方根之下，而減少DPB記憶體使用量之設計也可成為運用現有晶片大小以支援更大畫面的解決方案。如此一來，支援更大視訊畫面的即時性視訊編碼器設計將更具可行性。 With many-core architectures becoming the future trend of system design, computation time can be reduced if the video decoder appropriately its operations to multiple processing elements. However, bottleneck of performance may come to the access of off-chip memories because the performance gap between DRAM and processor cores are becoming further widened. In the H.264 video compression standard, the picture frame deblocking contributes about one-third of all computation and the Decode Picture Buffer (DPB) consumes a large portion of memory space and dominates the hardware cost in the video decoder. Hence, picture frame deblocking and decode picture management are crucial processes and are gaining our focus in this thesis. In this thesis, both 16-pixel-long and 4-pixel-long boundaries are used as the basis for analyzing and exploiting possible parallelism. Compared with the two-dimensional (2D) wave-front method order for deblocking both 1920×1080- and 1080×1920-pixel frames, the proposed order for 16-pixel-long boundaries exhibits speedups of 1.57 and 2.15, respectively, given an unlimited number of processing elements. In advance, the proposed order for 4-pixel-long boundaries exhibits speedups of 1.92 and 2.44 times, respectively. Comparing the both, the order for 4-pixel-long boundaries gains speedups of 1.25 and 1.13 times, respectively. As for proposed DPB size reduction scheme, the proposed design is able to achieve 60% - 70% overall reduction in storage, with a decrease in image quality ranging from 0.32dB to 1.98dB on PSNR compared to the standard H.264 decoder. In addition, as the frame size grows, proposed orders require only extra time that is proportional to the square root of the frame size increase (keeping the same width to height ratio), and proposed DPB size reduction scheme makes the support of the growing frame size easier using the same chip cost. So as to push the boundary of practical real-time deblocking of increasingly larger video sizes.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT079455832 http://hdl.handle.net/11536/127094
Appears in Collections:	Thesis