結合物件偵測與權重式特徵混合之影像標題生成

標題:	結合物件偵測與權重式特徵混合之影像標題生成 Image Caption with Object Detection and Weighted Feature Fusion
作者:	曾婉茹吳炳飛 Tseng, Wan-Ju Wu, Bing-Fei 電控工程研究所
關鍵字:	影像標題生成;物件偵測;卷積神經網路;遞歸神經網路;深度學習;Image Caption;Object Detection;Convolutional Neural Networks;Recurrent Neural Networks;Deep Learning
公開日期:	2017
摘要:	自動影像標題生成結合機器視覺與自然語言處理領域，產生符合文法和契合影像畫面的標題生成，本論文引入物件偵測方法實現更好之特徵表現，採用物件矩形方框特性作為決定權重分配之演算法，形成權重式的全域與區域特徵混合，讓物件與背景共同決定特徵表示，集中影像資訊；此外，物件於影像中之座標位置也將用於偵測視覺關係標籤，建立物件與人之互動關係辨認。影像標題生成之模型架構採用新穎的深度學習網路組合―卷積神經網路串接遞歸神經網路，分別作為影像特徵的編碼器，以及建立語言模型描繪語句的循序性，並且於兩網路間嵌入線性轉換層，不僅對於輸入向量降維，提取更好之壓縮特徵，同時改善輸入分布不一致之情況。最後，本論文於MS COCO資料庫評測模型，包含123,287張影像與616,435影像標題生成描述句，於實驗中四種指標BLEU4、METEOR、ROUGEL和CIDEr皆有提升，並達成26 FPS之即時系統效能。 Automatically describing the content of images connects computer vision and natural language processing. This thesis combines object detection with image caption to obtain better feature representations. The feature fusion proposes a simple weighting determination technique relying on only bounding box attributes to sum up local and global features. In addition, object coordinates predict the relations for building a bridge of human interactions. The model is based on a novel combination of Convolutional Neural Networks and Recurrent Neural Networks over regions in interests and sentences respectively, inserting objective embedding in the middle network layer for reducing internal covariate shift and inferring compressed features. In this paper, the system is evaluated on MS COCO dataset, which comprises 123,287 images and 616,435 descriptions. Also, experiments show BLEU4, METEOR, ROUGEL, and CIDEr score improvements while sustain 26 FPS real-time performance.
URI:	http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070460015 http://hdl.handle.net/11536/142463
顯示於類別：	畢業論文