標題: | 基於樣式之近似複製影片檢索、定位與註解 Pattern-Based Near-Duplicate Video Retrieval, Localization, and Annotation |
作者: | 周建利 李素瑛 蔡文錦 陳華總 Chou, Chien-Li Lee, Suh-Yin Tsai, Wen-Jiin Chen, Hua-Tsung 資訊科學與工程研究所 |
關鍵字: | 近似複製影片檢索;近似複製影片定位;影片副本偵測;大數據影片資料分析;自動影片註解;近似場景偵測;Near-duplicate video retrieval;Near-duplicate video localization;Video copy detection;Web-scale video analysis;Automatic video annotation;Near-scene detection |
公開日期: | 2015 |
摘要: | 隨著網路多媒體用戶的急速成長,許多社群分享網站不斷地有影片被上傳、觀看。但其中很多影片具有相似的內容,僅僅利用視覺轉換、時間轉換或是後製使其不盡相同。這些影片可能是使用者上傳時沒有發現有重複,也可能是使用者故意上傳類似的內容,但這些近似複製影片容易造成侵害著作權、重複搜尋結果、儲存空間浪費等問題。為了解決這些問題,我們提出了一個階層式架構下基於時空間樣式的近似複製影片檢索與定位方法。首先,我們提出的基於樣式之索引樹(Pattern-based Index Tree, PI-tree)可以快速濾除非近似複製影片,接著基於m-樣式之動態規劃(m-Pattern-based Dynamic Programming, mPDP)能夠定位近似複製的片段,並且根據定位分數重新排序近似複製影片。而為了更有效的擷取與定位近似複製影片,我們更提出了一個基於多特徵樣式索引的架構,基於樣式之修正前綴樹(Modified Pattern-based prefix Tree, MP-tree)可以索引樣式並快速搜尋配對樣式,為了計算查詢影片與參考影片的近似複製程度,我們提出了一個新穎的資料結構─多特徵時間關聯森林(Multi-Feature Temporal Relation Forest, MFTR-forest) 來計算配對樣式之間的時間關聯性。實驗的結果顯示,我們提出的新架構在各種不同的評量下皆優於目前最新的已發表方法。
除了考慮近似複製影片的缺點之外,這些影片也有一些可以利用的特性,例如近似複製影片的內容都是描述同樣的事物,因此我們可以以此特性為基礎來做影像註解。傳統的影像註解方法將標籤註解在關鍵影格、鏡頭或是整部影片上,然而,關鍵影格或是鏡頭的擷取與影片內容關聯性不大,多個主題的影片也很難用少數幾個字來描述,因此,我們在本論文中提出近似場景的概念,一個近似場景將具有相似的概念、主題或是語意。我們提出一個新穎的階層式影片到近似場景註解架構,用來保存近似場景中的語意,並且使其更純淨。為了偵測近似場景,我們先建立一個基於樣式之前綴樹來過濾非近似複製影片,再對剩下來的影片做分群,使每個群中的影片具有類似的近似複製片段以及語意。為了增加近似場景偵測的準確度,我們提出了樣式到強度遮罩演算法(Pattern-to-Intensity-Mark, PIM) 來做到更精確的框架等級近似複製片段對齊。每個近似場景我們為其計算一個影片到概念分布模型,藉由我們提出的潛能詞頻與反向文件頻率來分析關鍵字的代表性以及每個群的鑑別度。藉由完整的實驗可看出樣式到強度遮罩演算法可以更精確的找出近似複製影片片段,並且我們提出的影片到近似場景註解架構可以達到高品質的影片註解。 With the exponential growth of web multimedia contents, the Internet is rife with near-duplicate videos, the video copies applied with visual/temporal transformations and/or post productions. The numerous videos lead to the issues of copyright infringement, search result redundancy, storage waste, etc. To address these issues, we propose a spatiotemporal pattern-based approach under the hierarchical filter-and-refine framework for efficient and effective near-duplicate video retrieval and localization. First, non-near-duplicate videos are fast filtered out through a computationally efficient Pattern-based Index Tree (PI-tree). Then, m-Pattern-based Dynamic Programming (mPDP) is designed to localize near-duplicate segments and to re-rank the videos retrieved. For more effective retrieval and localization, a multi-feature framework based on a pattern indexing technique is also proposed. A Modified Pattern-based prefix tree (MP-tree) is proposed to index patterns of reference videos for fast pattern matching. For calculating how likely a query video and a reference video are near-duplicates, a novel data structure, termed Multi-Feature Temporal Relation forest (MFTR-forest), is proposed to discover the temporal relation among matched patterns and to evaluate the near-duplicate degree between a query video and each reference video. Comprehensive experiments on public datasets are conducted to verify the effectiveness and efficiency of the two proposed frameworks of near-duplicate video retrieval and localization. Experimental results demonstrate that both the two proposed frameworks outperform the state-of-the-art approaches compared in terms of several evaluation criteria. In addition to considering the disadvantages of near-duplicate videos, we can utilize the characteristics of those videos to perform automatic video annotation. Traditional video annotation approaches focus on annotating keyframes/shots or whole videos with semantic keywords. However, the extraction processes of keyframes/shots might lack semantic meanings, and it is hard to use a few keywords to describe the content of a long video with multiple topics. In this dissertation, near-scenes, which contain similar concepts, topics, or semantic meanings, are defined for better video content understanding and annotation. We propose a novel framework of hierarchical video-to-near-scene (HV2NS) annotation not only to preserve but also to purify the semantic meanings of near-scenes. To detect near-scenes, a pattern-based prefix tree is first constructed to fast retrieve near-duplicate videos. Then, the videos containing similar near-duplicate segments and similar keywords are clustered with consideration of multi-modal features including visual and textual features. To enhance the precision of near-scene detection, a pattern-to-intensity-mark (PIM) method is proposed to perform precise frame-level near-duplicate segment alignment. For each near-scene, a video-to-concept distribution model is designed to analyze the representativeness of keywords and discriminations of clusters by the proposed potential term frequency and inverse document frequency (potential TFIDF) and entropy. Tags are ranked according to video-to-concept distribution scores, and the tags with the highest scores are propagated to near-scenes detected. Extensive experiments demonstrate that the proposed PIM outperforms state-of-the-art approaches compared in terms of quality segments (QS) and quality frames (QF) for near-scene detection. Furthermore, the proposed framework of hierarchical video-to-near-scene annotation can achieve high quality of near-scene annotation in terms of mean average precision (MAP). |
URI: | http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT079955822 http://hdl.handle.net/11536/139873 |
Appears in Collections: | Thesis |