多型態與多視角之智慧型多媒體內容分析、理解與擷取

標題:	多型態與多視角之智慧型多媒體內容分析、理解與擷取 Multimodal and Multi-View Intelligent Multimedia Content Analysis, Understanding and Retrieval
作者:	陳華總 Chen Hua-tsung 國立交通大學資訊工程學系（所）
關鍵字:	多媒體訊號處理;影片了解;多視角影片分析;多媒體資訊檢索;影片自動註解;深度圖分析;多型態特徵整合;音訊內容分析;場景分類;軌跡擷取;行為辨識;Multimedia signal processing;video understanding;multi-view video processing;depth map;multimedia information retrieval;automatic video annotation;multimodal feature
公開日期:	2012
摘要:	隨著各式各樣多媒體應用的發展，數位化的影音多媒體內容與日劇增。許多研究致力於多媒體內容的分析與了解來研發實用的工具系統。目前大多數的研究著眼於單一視角的影片分析，然而，單一視角的影片分析常因物體重疊、缺乏深度資訊，且受制於不完整的三維資訊，往往成效有限。因此，本計畫除了利用視訊資料中所擷取之低階特徵，推論移動物體的動作與軌跡，來做事件偵測與行為分析之外，更進一步研究如何分析統整多台攝影機取得的資訊以超越單一視角影片分析之效果。此外，將整合Microsoft Xbox Kinect 所取得之深度圖以及聲音訊號等多型態之資訊，以期對多媒體內容之分析與了解、索引與查詢更臻至完善。人體輪廓的特徵與變化往往代表著不同的姿勢與動作，因此我們提出有效之人體輪廓擷取演算法，並計算具有代表性之人體骨架表示法來進行動作分析與事件偵測。此外，在影片中，有意義的事件主要都是起源於移動物體之間的互動，這同時也是使用者感興趣的部分。因此，我們必需先切割並追蹤移動物體，計算出物體的運動軌跡。物體的運動常有特定的模式，利用此特性，將可以更有效率且更準確地計算出物體的運動軌跡。有了物體的運動軌跡之後，我們針對特定事件設計相關法則，找尋符合的軌跡，推測相對應之事件，並對事件自動產生註解。在其它應用方面，可以有不同的做法：我們對正常行為的運動軌跡設定法則，當出現法則之外的軌跡時，即可偵測到有不正常行為的發生。另一方面，Kinect 所取得之深度圖，突破了單一攝影機缺乏深度資訊之限制，因此我們將分析Kinect 深度圖，用以加強影片內容分析及行為辨識之成效。有了物體在畫面中的二維運動軌跡之後，我們更進一步地研究如何利用物體運動的物理特性與領域知識，從單一視角影片來重建三維軌跡。以運動影片為例，因受地心引力的作用，球在空中的三維移動軌跡可用拋物線函式來模擬，而球場規格的領域知識可以用來計算攝影機參數，再加上先前所求得之二維軌跡，我們將可以估算出用來模擬三維球路的拋物線函式，進而重建球在三維空間的運動軌跡。然而，我們並沒有完全考慮到所有影響球路軌跡的因素，像是空氣摩擦力、球的旋轉角度與速度等等，所以從單一視角所建構之三維球路與真實的球路會有誤差。此外，單一視角影片的物體追蹤與軌跡計算，會因物體重疊而產生錯誤。因此，我們將研發如何分析多台攝影機的資訊，試著更準確地重建三維資訊。多重視角之影片分析，首先要計算對應點，要能找出不同視角中哪些點在真實環境中是同一個點。特定線段的偵測往往比特定點的偵測來的準確，因此，我們將偵測不同視角的特徵線段，然後以線段的端點或線段之間的交點來當作特徵點，計算出不同視角中的對應關係。最後，我們統整分析多台攝影機的影像資料以及其間的對應關係，更精確地計算物體移動軌跡與重建三維資訊，同時整合深度圖與聲音訊號等多型態資訊，以研發並建構多重視角多型態之智慧型多媒體內容分析、理解與擷取系統。 The explosive proliferation of multimedia data in education, entertainment, sport and various applications necessitates the development of practical systems and tools for multimedia content analysis, understanding, indexing and retrieval. However, the majority of the existing related works focus on single-view video analysis, which is limited by the lack of the depth information, the incompetent 3D information, and the problem of object occlusion. Therefore, in addition to feature extraction, object tracking, event detection and action recognition in single-view video, we further research on how to integrate the information from multiple cameras. Moreover, we also include other multimodal features, such as audio signal and the depth map produced by Microsoft Xbox Kinect, so as to reinforce multimedia content analysis and understanding. The features and motions of the human body contour imply different postures and actions. Hence, we propose effective algorithms of human body contouring, as well as the skeleton representation for action recognition and event detection. Besides, significant events are mainly caused by the interaction of moving objects, so we segment and track the moving objects, and then model the trajectory based on the physical characteristic of object motion so as to extract the trajectory more efficiently and accurately. Then, we design rules to map trajectories to corresponding events for automatic annotation. Applications of trajectory-based video analysis can also be accomplished. Moreover, we can detect an abnormal event if a trajectory which does not match the pre-defined trajectories of normal events is taking place. On the other hand, the Kinect depth map overcomes the obstacle that single-view video analysis is limited by the lack of the depth information. Therefore, we attempt to apply depth map analysis to improve the effectiveness of content understanding and action recognition. Furthermore, we propose a scheme to reconstruct 3D information from single-view video by incorporating the motion characteristic and domain knowledge. Take sports video for example. The court specification is used for camera calibration. Motion equations are set up to define the 3D ball trajectory based on the physical characteristic. Then, we map the equation-based 3D ball positions to 2D ball coordinates by the projection matrix computed in camera calibration. With the 2D ball coordinates being known, we can approximate the 3D motion equation for 3D trajectory reconstruction. However, there may be deviation between the actual trajectory and the reconstructed 3D trajectory, due to the effects of the physical factors we do not involve, such as air friction, ball spin rate and spin axis, etc. Moreover, there may be errors in object segmentation and tracking due to object occlusion. Therefore, it is indispensable to integrate information from multi-cameras. As an important task of multi-view video processing, the computation of point correspondence aims at finding the same feature points in different views. Since the detection of specific lines is more robust than the detection of specific points, we extract the endpoints or intersection points of the lines as feature points to compute the correspondence. We research on the challenge of integrating the information from multiple cameras and perform multi-view video analysis. Furthermore, other multimodal features, such as audio signal and the depth map, are also taken into consideration. Finally, we develop integrated systems for multi-view and multimodal intelligent multimedia content analysis, understanding and retrieval.
官方說明文件#:	NSC101-2218-E009-004
URI:	http://hdl.handle.net/11536/98424 https://www.grb.gov.tw/search/planDetail?id=2453346&docId=383728
Appears in Collections:	Research Plans