標題: 視訊嵌入轉碼器之演算法與其硬體架構設計空間探討
An Algorithm and Its Architecture Design Space Exploration of a Video Embedding Transcoder
作者: 李志鴻
Chih-Hung Li
蔣迪豪
Tihao Chiang
電子研究所
關鍵字: 視訊嵌入轉碼器;Video Embedding Transcoding
公開日期: 2007
摘要: 視訊嵌入服務在今日多元化的多媒體應用中越來越廣泛,由於多媒體的龐大資料量,現今大部分的視訊資料都以壓縮的格式儲存與傳遞,在眾多壓縮標準中,H.264/AVC已成為目前視訊壓縮的主流,因此針對H.264/AVC標準的視訊嵌入轉碼器將更加重要,以達到資料儲存以及網路傳輸的高效率。本論文主要是針對視訊嵌入轉碼器的演算法發展與硬體架構設計探索:第一,我們所發表的H.264/AVC視訊嵌入快速轉碼演算法乃為目前文獻上第一篇在H.264/AVC標準下的視訊嵌入轉碼技術。第二,關於硬體架構設計,我們利用最低的成本,成功結合了H.264/AVC轉碼與編碼功能於單一個硬體架構中,此乃文獻中第一個實現此技術的設計。第三,關於硬體設計探索(Design Space Exploration),則是文獻上第一篇以資料交換層級(Transaction Level Modeling)做系統效能模擬分析的H.264/AVC相關之硬體設計。 本論文第一部份著重於多視窗視訊嵌入轉碼器(Multiple-Window Video Embedding Transcoder)之低複雜度演算法的發展。為解決傳統上串聯式像素值域轉碼器(Cascaded Pixel Domain Transcoder)的高複雜度難題,我們採用部份重新壓縮(Partial Re-encoding)的概念來降低轉碼時所需的運算量,並減少因重新量化(Re-quantization)所造成的轉碼視訊品質下降。針對預測不協調(Prediction Mismatch)的區塊,我們利用原始壓縮位元流內的資訊來幫助預測微調(Prediction Refinement),即幅內模式轉換(Intra Mode Switching)以及運動向量重新映射(Motion Vector Re-mapping),如此,我們可以完全去除壓縮器中複雜度最高的兩個模組:模式決策(Mode Decision)和運動估計(Motion Estimation)。針對殘餘值不協調(Residue Mismatch)的區塊,我們從理論與實驗數據的推導,有效率的找出最需要做錯誤修正(Error Correction)的區塊,從實驗數據顯示,我們只需針對極少部份區塊作錯誤修正,即可將轉碼視訊品質大幅提高2dB左右。比起串聯式像素值域轉碼器,我們可以將轉碼速率提高25倍,更令人驚訝的,我們的低複雜度演算法最多可以有將近1.5dB的PSNR改善。 本論文第二部份著重於設計空間探索(Design Space Exploration)。基於所提出的低複雜度演算法,我們將之實現於平台式(Platform-based Design)系統設計並以最節省的成本將H.264/AVC轉碼器與解碼器結合在一個平台上。在所提出的高效能系統硬體架構中,我們針對幾個重要的系統參數做探索:硬體平行度(Parallelism) 、資料交換精細度(Data Exchange Granularity)以及設計平衡(Design Balancing)。不同於傳統由下而上的設計哲學(Bottom-Up Design Methodology),我們採用新穎的由上而下、逐步精確的設計哲學(Top-Down and Refinement-based Design Methodology)以獲得較優異的探索效能。我們主要採用電子系統層級(Electronic System Level) 來做系統模擬以及探索,其模擬平台大都是操作在資料交換層級(Transaction Level Modeling),其模擬的效能較傳統的暫存器傳輸級(Register Transfer Level)快上三個數量級左右。因此,比起傳統的系統設計,我們的設計提供相當大的自由空間來針對不同的設計限制作最佳化:針對硬體成本最佳化,我們的設計選擇(Design Alternative)可以將硬體成本降低至原本的25%。針對速度最佳化,我們的設計選擇可以將速度增加為原本的兩倍。在135MHz的操作時脈下,我們可以針對1920x1088每秒60幅的高畫質視訊提供即時的轉碼或解碼輸出。 本論文第三部份著重於低成本高效能的硬體模組設計開發,我們著重在兩個核心區塊:像素預測(Pixel Prediction)和去邊濾波器(Deblocking Filter)。第一,我們成功結合了H.264/AVC中幅內與幅間預測於單一個硬體架構中,除了增加硬體使用效率外,亦大幅減少資料匯流排上的傳輸。第二,我們提出了一個具有同步細緻可調(Fine-Grained Synchronization Capability)的去邊濾波器,如此可以讓視訊資料管線(Video Pipe) 的效能在不同的資料交換精細度(Level of Granularity)中,都能獲得提升。 總結,本論文提出一個低複雜度、高效率之視訊嵌入轉碼器。在演算法上,我們著重於快速有效率的微調與修正。在硬體設計空間探索上,我們有效率且量化地分析了各個系統參數的影響,以期在不同的設計限制下都能獲得最佳化。在硬體設計上,我們著重於硬體使用效率的增加以及整體系統效能的增加。
As the H.264/AVC standard is receiving worldwide adoption, the video embedding service is an important feature and thus this thesis presents an H.264/AVC multiple-window video embedding transcoder in three parts including an algorithm, a system architecture design space exploration, and two novel micro-architectures. The first part describes a low-complexity algorithm as compared to the traditional cascaded pixel domain transcoder. The partial re-encoding is adopted to reduce the complexity and quality degradation due to re-quantization. Specifically, the intra mode switching and motion vector re-mapping techniques are used to eliminate the need for the mode decision and motion estimation modules. Moreover, with the theoretical analysis, only 5% of the total blocks needs error correction. The proposed approach can improve the quality up to 1.5 dB and enhance the throughput by 25 times as compared to the traditional cascaded transcoder. The second part describes architecture for platform-based video embedding transcoder and its design space exploration from several system aspects: hardware parallelism, data exchange granularity and design load balancing using transaction level modeling. The top-down refinement-based design methodology provides effective exploration with high degree of freedom to optimize for various design constraints. Further, it identifies which critical module and how it can be optimized in terms of speed, memory and bandwidth for improving the overall performance. Our best design alternative can reduce the cost by four times or speed up by two times such that our design can achieve 1920x1088 @ 60 Hz video transcoding at 135MHz. The third part describes the two novel micro-architectures designed for the prediction and the deblocking filter modules. The prediction is unified with a systolic architecture to improve the hardware utilization and the transmission bandwidth. The de-blocking filter is implemented with a multi-level fine-grain synchronization granularity to improve the system performance at finer level of granularity.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009011614
http://hdl.handle.net/11536/80435
Appears in Collections:Thesis


Files in This Item:

  1. 161401.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.