於文件剪輯系統中以版面分析為基礎的圖說翠取技術

標題:	於文件剪輯系統中以版面分析為基礎的圖說翠取技術 Caption Extraction of A Document Clipping System by Layout Analysis
作者:	劉一葦 Liu, Yi-Wei 李錫堅 Dr. Lee, Hsi-Jian 資訊科學與工程研究所
關鍵字:	版面分析;圖說;圖片搜尋;文件分析;layout analysis;caption;photograph search;document analysis
公開日期:	2002
摘要:	本論文之研究目的在於提出一個文件處理系統來節取出文件影像上的圖說(caption)。擷取圖說的目的為當成在資料庫中搜尋其相對應的圖片的關鍵字。我們提出了包含在現有的文件分析系統下擷取圖說的方法，以及以一個獨立系統的方式去擷取圖說的方法。我們將圖說分成兩大類：一般圖說和圖片區塊內部圖說。一般圖說又細分為三小類，用六種特徵值來區分：位置、文字行走向、文字區塊大小、文字大小、區塊間的距離、版面的複雜度。在現有的文件分析系統中，我們直接使用版面分析的結果來擷取圖說。第一個步驟是是選擇圖片區塊(photograph block)。我們用圖片區塊的大小作為選擇的標準。第二個步驟是選出圖說候選區塊。我們將圖片區塊的相鄰文字區塊當成圖說候選區塊。我們用一般圖說擷取法來測試每個候選圖說區塊的特徵值來找出一般圖說圖說候選區塊、圖片區塊內部圖說擷取法來擷取出在圖片區塊內部的圖說。以一個獨立系統的方式去擷取圖說時，我們先將文件影像進行二值化(image binarization)的動作。接著進行縮圖(image reduction) 、產生相連元件(connected-component)等步驟。我們使用相連元件的大小來抽出圖片區塊。接著我們在種子元件區域(seed component region)找出種子元件來產生文字行。我們使用一個動態門檻值(dynamic threshold)去決定將哪些相連元件組成文字行、並使用我們所提出來的規則去進行文字行的合併。我們使用產生出來的文字行去擷取出一般圖說和圖片區塊內部圖說，並在必要的時候去產生文字區塊。在實驗的部分，我們測試了238張文件影像，其中包含了273個圖片。在273個圖片中，14個圖片無法對應到我們對圖說的分類。在現有的文件分析系統中進行圖說擷取有81%的正確率，扣除無法分類的圖片後有85.3%正確率。在以獨立的圖說傑去系統中進行圖說擷取有87.9%的正確率，扣除無法分類的圖片後有92.6%正確率。 In this thesis, a document processing system is proposed to extract captions of photographs in document images. Captions are extracted for searching their associated photographs. Our methods include extracting captions in an existed document analysis system and as an independent system. Captions are classified into two catalogs: normal caption and inside caption. Normal captions are subdivided into three types. Each type is defined by six features: position, orientation of text lines, size of text block, size of characters, distance between two blocks, and complexity of layout. We use those types to test each caption candidate and find captions. In the existed document analysis system, we can directly use the information of layout to extract captions. The first step of caption extraction is photograph blocks selection. We use the size of block as the criterion to select photograph blocks. The second step is to find caption candidates. We use the neighboring text blocks of the photograph blocks as caption candidates. Normal caption extraction is performed to test the candidates and find normal captions. Inside caption extraction is performed to extract the caption inside the photograph blocks. In an independent system, we perform binarization first. Then the document image is reduced and connected-components are extracted. We classify photograph blocks by the estimated character size. In the step of text line generation, seed component regions are defined and we find seed components to generate text line. We use a dynamic threshold to merge connected-components into text lines and the proposed rule to merge text lines. Inside caption extraction and normal caption extraction use text lines to extract captions, and text blocks are generated if needed. In our experiments, we tested 238 document images include 273 photographs. 14 of 273 photographs can not be cataloged. The caption extraction in an existed document analysis system has 81% success rates with all photographs and 85.3% success rates without no catalog photographs. The caption extraction as an independent system has 87.9% success rates with all photographs and 92.6% success rates without no catalog photographs.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT910392035 http://hdl.handle.net/11536/70106
Appears in Collections:	Thesis