標題: 高階視訊處理、擷取、特徵粹取及視訊結構化計算之研究
Towards High-Level Content-Based Video Retrieval and Video Structuring
作者: 陳敦裕
Duan-Yu Chen
李素瑛
Suh-Yin Lee
資訊科學與工程研究所
關鍵字: 視訊處理;視訊擷取;視訊特徵粹取;視訊結構化計算;video processing;content-based video retrieval;video feature extraction;video structuring
公開日期: 2004
摘要: 隨著數位視訊在教育、娛樂、以及其它多媒體應用的發展下,造成數位視訊資料大量且迅速增加。在此情況之下,對於使用者而言,需憑藉一個有效的工具來快速且有效率地獲得所要的視訊資料。在搜尋視訊資料的方法中,對於使用者而言以內容為基礎之方法最具有高階語意意義,也最為自然且友善。因此,以視訊內容為基礎之搜尋、瀏覽以及擷取吸引各領域的學者研發各種粹取視訊資料中的高階特徵,以提供有效率地搜尋並擷取資料。但另一方面,隨著視訊資料壓縮法的成熟,愈來愈多的視訊資料以壓縮型態儲存,特別是MPEG格式。因此也吸引了愈來愈多的學者投入在壓縮的視訊資料中粹取其高階特徵之研究。本論文主旨在於研發粹取精簡且有效之視訊特徵,並達成具有語意之高階視訊資料結構化。 首先,我們在壓縮視訊資料中偵測移動物體,並提出移動物體追蹤演算法,以追蹤物體並產生物體軌跡,憑藉著物體軌跡,推測相對應之事件並產生事件之標籤,最終建立以事件為基礎之視訊資料結構化瀏覽系統。 在建立高階視訊資料結構化當中,除了視覺資料之外,文字資料亦是更具有語意意義的特徵,因此我們也提出了在壓縮視訊資料當中偵測文字字幕,並利用字幕的長時間出現特性作為濾除雜訊之基礎以及文字字幕其梯度能量較高之特性,以此獲得有意義的文字字幕,提供具語意之視訊結構化之計算。 為了提供有效的視訊資料相似性的比對,以利視訊資料擷取,我們也提出了兩個以移動物體為基礎之高階特徵(T2D-Histogram Descriptor以及Temporal MIMB Moments Descriptor)。與傳統方法在粹取視訊資料特徵僅考慮空間特性不同,我們所提出的兩個descriptor利用了視訊資料之空間以及時間的特性。我們以Discrete Cosine Transform之能量集中之特性,將各個影格之空間特性作為連結,並大幅降低特徵值之資料量,達到高階視訊特徵精簡化但視訊資料相似性比對高效率的目的。 我們進行了大規模完整的實驗以評估所提各方法的效能。在我們的實驗範圍中,結果顯示,對於眾多的測試視訊資料,我們的視訊資料相似性比對的方法都優於許多著名的方法。
With the increasing digital videos in education, entertainment and other multimedia applications, there is an urgent demand for tools that allow an efficient way for users to acquire desired video data. Content-based searching, browsing and retrieval is more natural, friendly and semantically meaningful to users. With the technique of video compression getting mature, lots of videos are being stored in compressed form and accordingly more and more researches focus on the feature extractions in compressed videos especially in MPEG format. This thesis aims to investigate high-level semantic video features in compressed domain for efficient video retrieval and video browsing. We propose an approach for video abstraction to generate semantically meaningful video clips and associated metadata. Based on the concept of long-term consistency of spatial-temporal relationship between objects in consecutive P-frames, the algorithm of multi-object tracking is designed to locate the objects and to generate the trajectory of each object without size constraint. Utilizing the object trajectory coupled with domain knowledge, the event inference module detects and identifies the events in the application of tennis sports. Consequently, the event information and metadata of associated video clips are extracted and the abstraction of video streams is accomplished. A novel mechanism is proposed to automatically parse sports videos in compressed domain and then to construct a concise table of video content employing the superimposed closed captions and the semantic classes of video shots. The efficient approach of closed caption localization is proposed to first detect caption frames in meaningful shots. Then caption frames instead of every frame are selected as targets for detecting closed captions based on long-term consistency without size constraint. Besides, in order to support discriminate captions of interest automatically, a novel tool – font size detector is proposed to recognize the font size of closed captions using compressed data in MPEG videos. For effective video retrieval, we propose a high-level motion activity descriptor, object-based transformed 2D-histogram (T2D-Histogram), which exploits both spatial and temporal features to characterize video sequences in a semantics-based manner. The Discrete Cosine Transform (DCT) is applied to convert the object-based 2D-histogram sequences from the time domain to the frequency domain. Using this transform, the original high-dimensional time domain features used to represent successive frames are significantly reduced to a set of low-dimensional features in frequency domain. The energy concentration property of DCT allows us to use only a few DCT coefficients to effectively capture the variations of moving objects. Having the efficient scheme for video representation, one can perform video retrieval in an accurate and efficient way. Furthermore, we propose a high-level compact motion-pattern descriptor, temporal motion intensity of moving blobs (MIMB) moments, which exploits both spatial invariants and temporal features to characterize video sequences. The energy concentration property of DCT allows us to use only a few DCT coefficients to precisely capture the variations of moving blobs. Compared to the motion activity descriptors, RLD and SAH, of MPEG-7, the proposed descriptor yield 40% and 21 % average performance gains over RLD and SAH, respectively. Comprehensive experiments have been conducted to assess the performance of the proposed methods. The empirical results show that these methods outperform state-of-the-art methods with respective various datasets of different characteristics.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT008717814
http://hdl.handle.net/11536/45556
顯示於類別:畢業論文


文件中的檔案:

  1. 781401.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。