利用決策樹方法及直接使用系統字型資料作多種類文字辨識及電子書自動建構

標題:	利用決策樹方法及直接使用系統字型資料作多種類文字辨識及電子書自動建構 Multi-Class Character Recognition by Decision-Tree Approaches and Direct Use of System Font Data for Automatic Digital Book Construction
作者:	陳嘉亨 Chia-Heng Chen 蔡文祥 Wen-Hsiang Tsai 資訊科學與工程研究所
關鍵字:	多種類文字辨識;文字型別分類;文字辨識;系統字型資料;決策樹;樣板相配;電子書;multi-type character recognition;character type classification;optical character recognition;system font data;decision tree;template matching;digital book
公開日期:	2001
摘要:	利用影像分析及文字辨識的技巧，我們提出一個可以自動建構電子書的方法。文字辨識的主要工作，是希望能辨識多種類的文字。此方法不需要使用文字影像資料來學習，而是直接使用系統字型來當做參考文字。在我們的方法中，有四個階段：文字型別的分類、文字辨識、書頁版面的分析，以及電子書的建構展示。在文字型別的分類階段，我們處理四種文字型別，第一種型別是標題中的中文字，而其餘三種型別則為文章中的中文字、英數文字和標點符號。我們利用決策樹提出一個對文字型別作分類的方法。在文字辨識的階段中，首先我們提出一個不需學習參考資料而直接使用系統字型資料的方法。接著，針對文章中的中文字，我們提出一個利用決策樹及樣板相配來辨識印刷中文字的方法。而針對標題中的中文字、文章中的英數字和標點符號，我們也提出一個主要是利用樣板相配的辨識方法。在這些方法中，成對的影像組成成分廣泛地被利用來協助文字辨識的工作。在書頁版面的分析階段中，我們利用矩量保持二值化及區塊生長技術，從影像內容中取得所有的連接小塊。針對書頁影像中不同的組成成分，我們使用不同的壓縮技術來壓縮它們，以改善整體的壓縮率。良好的實驗結果，顯示了我們所提出方法的可行性。 Based on image analysis and character recognition techniques, a system for digitizing a printed book automatically into a digital version is proposed. In the major work of character recognition, multi-type characters can be recognized. And no character image data need be used for learning; the system fonts are used as the reference characters directly. There exist four phases in the proposed system processes: character type classification, character recognition, page layout analysis, and digital book construction and display. In the phase of character type classification, four types of characters are dealt with, including Chinese characters in titles, and Chinese, alphanumerical, and punctuation characters in texts. A decision-tree method for classifying these character types is proposed. In the phase of character recognition, a method, which uses directly system font data without reference data learning, is proposed first. For printed Chinese characters in texts, a method to recognize them based on decision trees and template matching is proposed next. And for the other miscellaneous types of characters including Chinese characters in titles, and alphanumerical characters and punctuation characters in texts, a method based mainly on template matching is also proposed to recognize them. In these methods, pairs of image components are used extensively to help the recognition work. In the phase of page layout analysis, all the connected components are segmented out of image contents effectively using moment-preserving thresholding and region-growing techniques. Then, different compression techniques are utilized to reduce the data volumes of different components in the page images to improve the overall compression ratio for the resulting digital book. Good experimental results reveal the feasibility of the proposed methods.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT900394070 http://hdl.handle.net/11536/68597
顯示於類別：	畢業論文