標題: 應用於HEVC即時具交錯細微排程超高解析度移動估測處理器
Real Time Ultra HD Motion Estimation Processor with Fine Grained Interleaved Scheduling for HEVC
作者: 方志仲
Fang, Chih-Chung
張添烜
Chang, Tian-Sheuan
電子工程學系 電子研究所
關鍵字: 移動估測;交錯細微排程;影像壓縮;處理器;HEVC;Motion Estimation;Interleaved Fine Grained;Inter Prediction
公開日期: 2015
摘要: 最新視訊編碼標準HEVC中,畫面間預測編碼方法有許多演進使得即時影像編碼大幅提升,支援更大的編碼單元並採用由大往小逐一編碼的遞迴式架構,透過編碼單元之間的高度相依性來增加編碼效能。在硬體設計上由於高解析度影像的高度運算量,編碼預測單元之間的資料相依性的編碼限制,對各處理資料排管線化設計造成瓶頸,編碼時間攏長而無更多的時間頻寬讓演算法繼續優化。為了突破目前硬體設計的窘境,這篇論文提出一套有別於以往設計的硬體架構,以解決或改善上述所提之問題及瓶頸。 演算法的部分,以降低計算量和減輕資料相依性為考量,整數移動估計的部份採用修改式PEPZS做法,降低資料運算量,並在省下的時間中納入分像素精準度AMVP運算以增加及早中止演算法的效能,在分數移動估測則採用混合式快速演算法,針對PU64x64和PU32x32使用無內插式估算,除了降低運算量也能降低對記憶體頻寬的需求,PU16x16和PU8x8採用無內插式估算以及修正搜尋以增加整體編碼效能,在此演算法規畫下達到計算量與效能平衡。 硬體部分,為提高因資料相依性造成的低硬體使用率,資料處理排程以交錯式排程,規劃無資料相依性的預測編碼區塊能交錯同時運算,縮短整體運算時間,而由於HEVC的資料相依程度非常高,因此交錯式架構能縮短的運算時間尚不顯著,因此在細部運算採用以硬體資源為導向的細微排程,將IME/FME運算拆解為SAD/SATD/interpolation filter等較細小單元,進一步縮短更底層運算硬體的空閒時間;而在原有運算複雜度架構上增加分像素AMVP的計算後仍能縮短10%的整體運算時間。 實驗結果在HM13.0 BD-rate效能的表現,在YUV成分分別降低了4.0%、4.3%及4.3%,設計的硬體以TSMC 90nm的技術合成,需要422.9K邏輯閘數目及21.1K位元組的晶片內建記憶體,在工作頻率270MHz的情況下,可支援Bi-prediction編碼每秒30張4Kx2K畫面大小的影片。
The coding structure, larger prediction unit size and recursive high dependency prediction method in the latest HEVC coding standard brings better coding efficiency but also significant data dependency, computational complexity. To solve these problems, this thesis proposes fast inter prediction algorithms and the hardware architecture. For the algorithm part, we adopt fast algorithms to reduce complexity and lessen data dependency. The fast inter prediction algorithm adopts the modified PEPZS IME algorithm to reduce the data computation and involves fractional AMVP operation, early termination to increase coding efficiency. The FME part make a tradeoff between complexity and performance by a PU size dependent FME that applies interpolation free FME to all PU size and additional refinement search to PU 16x16 and PU 8x8. For the hardware design, the adopted algorithm is combined with an interleaving scheduling structure to release the high dependency problem and improve hardware utilization. The interleaving structure allows processing the PU blocks without dependency at the same time. Furthermore, we decompose IME/FME into SAD/SATD/interpolation filter units so that such fine grained scheduling can further reduce the overall execution time. This architecture and scheduling can cut down about 10% of the execution time under the computational complexity of previous work even with extra fractional AMVP computation. The simulation result compared to HEVC reference software HM 13.0 illustrates the BD-rate performance drop by 4.0%, 4.3% and 4.3% for Y, U and V component. The proposed design cost 422.9 logic gates and 21.1 Kbytes of on-chip memory under TSMC 90nm CMOS process. It could support 4Kx2K 30 fps video under bi-prediction condition at 270 MHz operation frequency.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070250225
http://hdl.handle.net/11536/127043
顯示於類別:畢業論文