利用連接元件分析來做文件拼圖

標題:	利用連接元件分析來做文件拼圖 Document Mosaic via Connected-Component Analysis
作者:	吳杭芫 Hung-Yuan Wu 李錫堅 Hsi-Jian Lee 資訊科學與工程研究所
關鍵字:	連接元件;文件拼圖;connected-component;document mosaic
公開日期:	2001
摘要:	在本篇論文中，我們建立一套文件合併系統用以解決在一般中文報紙合併時所遇到的一些問題。這些問題包括了文字的合併、標題的合併、表格的合併、圖形的合併等。首先，所要處理的文件可能會傾斜，這種傾斜可能會影響到切字及辨識，我們嘗試使用一種方法偵測傾斜角度並提出一個快速的傾斜校正方法來旋轉文件影像。在這個系統中，為了做各種形式合併，我們必須先實行文件分析來認知文字資訊、標題資訊、表格資訊、圖形資訊。在文件合併方面，我們有兩個方法，第一個方法是：我們利用在文章裡每個未辨識中文字的連通元件(Connected-Components)數目不同的特性來找出文件相同處，進而利用此相同處合併它們。第二個方法是：我們利用在文章裡每個已辨識中文字的唯一獨特性來找出文件相同處，進而利用此相同處合併它們。在標題合併方面，我們也是利用在標題中每個未辨識中文字的連通元件(Connected-Components)數目不同的特性來找出文件相同處，進而利用此相同處合併它們。在表格合併方面，我們首先刪除表格裡的水平和垂直線，於是我們得到了文字部分，所以利用在文章裡每個未辨識中文字的連通元件(Connected-Components)數目不同的特性來找出文件相同處，進而利用此相同處合併它們。在圖形合併方面，我們抽取在圖形內一些較大連通元件(Connected-Components)，然後比較這些較大連通元件(Connected-Components)的長度與寬度之大小來找出圖形相同處，進而利用此相同處合併它們。以我們的實驗結果來說，這套文件合併系統有不錯的表現。 In this thesis, we construct a document matching and merging system to solve several problems in dealing with Chinese newspapers. These problems include matching and merging of plain texts, matching and merging of titles, matching and merging of tables, matching and merging of pictures, and matching and merging of mixed type images. Input images probably have skew angles. These skew angles will affect the performance of character segmentation and character recognition. A skew angle detection method is proposed and a rotation transform is used to rotate document images. For the matching and merging of Chinese newspapers, we first perform document layout analysis to get text block, title block, table block, and picture block. For the matching and merging of plain texts, we propose two methods. The first method is about the feature of the number of connected-components in a character. In most cases, different characters have different values of the number of connected-components in characters. We utilize the property to find matching between overlapped text images and merge them as a bigger text image. The second method is about the feature that a Chinese character is a basic unit. It is very simple to compare two Chinese characters. We utilize the property to find matching between overlapped texts and merge them as a bigger text. For the matching and merging of titles, our method is about the feature of the number of connected-components in a character. In most cases, different characters have different values of the number of connected-components in characters. We utilize the property to find matching between overlapped title images and merge them as a bigger title image. For the matching and merging of tables, we first erase horizontal and vertical lines in the table and translate the table into the plain text image. And we use a method about the feature of the number of connected-components in a character. In most cases, different characters have different values of the number of connected-components in characters. We utilize the property to find matching between overlapped tables and merge them as a bigger table. For the matching and merging of pictures, we extract larger connected-components by assigned thresholds. And we perform equal formulations on the height and width of these larger connected-components. If the heights of these larger connected-components are similar and the widths of these larger connected-components are similar, we can get the matching between overlapped pictures and merge them as a bigger picture. Our matching and merging system has a good performance to our experimental results.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT900392112 http://hdl.handle.net/11536/68520
顯示於類別：	畢業論文