標題: 中文雜誌內對中英文字與圖混合之切字
Character Segmentation in Chinese Magazines with Mixed Alphabets, Numerals and Figures
作者: 鄭紹余
Shau-Yu Cheng
李錫堅
Hsi-Jian Lee
資訊科學與工程研究所
關鍵字: 切字;character segmentation
公開日期: 1998
摘要: 一般的文件處理系統包含兩個部份:文字切割與文字辨識。在本論文提出了一個有效率之文字切割系統。 這個系統含有兩個模組: 文件分析與文字切割。在文件分析部份,我們先進行縮圖與抽取連通元件(Connected-Components) ,接著將連通元件分為圖形或文字元件。在抽取出文件上之文字元件後,我們將文字元件合併成文字區塊,並檢查圖元件內是否有文字元件。若有,則抽取出來並合併至文字區塊中。最後,對所有的文字區塊切割出一行行之文字。 當區塊的文字行被切開後,針對每個文字區塊,我們先檢查區塊中是否有首字放大情形。若有,則抽取之。最後,我們針對每個文字行執行文字切割以切出中文、英文與數字。 在我們的實驗中,文字切割的正確率約98.9% ,對於一份內含1158個的文件所需時間為5秒。由此證明了我們系統的效率。
A general document processing system usually includes two major modules: character segmentation module and character recognition module. In this thesis, we present an automatic system to segment characters efficiently. Our character segmentation system contains two modules: document layout analysis and character segmentation. In the document layout analysis module, we first perform image reduction and connected-components extraction. In the component classification procedure, the connected-components be classified as image components or text components. In the block segmentation procedure, we merge all text components into text blocks . The extraction of text components from image components can group all text components into text blocks. Finally, we perform text line segmentation to segment all text lines in the text blocks. After all text lines have been segmented, we found and extracted the initial caps if they exist in the text blocks. Finally we segment the Chinese characters, English letters and numerals in the character segmentation module. In our experiment, the character segmentation rate of our system is about 98.9% and the processing time is about 5 seconds per page with 1158 characters. This proves the effectiveness of our proposed system.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT870392024
http://hdl.handle.net/11536/64044
顯示於類別:畢業論文