利用決策樹方法及直接使用系統字型資料作多種類文字辨識及電子書自動建構

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.author	陳嘉亨	en_US
dc.contributor.author	Chia-Heng Chen	en_US
dc.contributor.author	蔡文祥	en_US
dc.contributor.author	Wen-Hsiang Tsai	en_US
dc.date.accessioned	2014-12-12T02:27:51Z	-
dc.date.available	2014-12-12T02:27:51Z	-
dc.date.issued	2001	en_US
dc.identifier.uri	http://140.113.39.130/cdrfb3/record/nctu/#NT900394070	en_US
dc.identifier.uri	http://hdl.handle.net/11536/68597	-
dc.description.abstract	利用影像分析及文字辨識的技巧，我們提出一個可以自動建構電子書的方法。文字辨識的主要工作，是希望能辨識多種類的文字。此方法不需要使用文字影像資料來學習，而是直接使用系統字型來當做參考文字。在我們的方法中，有四個階段：文字型別的分類、文字辨識、書頁版面的分析，以及電子書的建構展示。在文字型別的分類階段，我們處理四種文字型別，第一種型別是標題中的中文字，而其餘三種型別則為文章中的中文字、英數文字和標點符號。我們利用決策樹提出一個對文字型別作分類的方法。在文字辨識的階段中，首先我們提出一個不需學習參考資料而直接使用系統字型資料的方法。接著，針對文章中的中文字，我們提出一個利用決策樹及樣板相配來辨識印刷中文字的方法。而針對標題中的中文字、文章中的英數字和標點符號，我們也提出一個主要是利用樣板相配的辨識方法。在這些方法中，成對的影像組成成分廣泛地被利用來協助文字辨識的工作。在書頁版面的分析階段中，我們利用矩量保持二值化及區塊生長技術，從影像內容中取得所有的連接小塊。針對書頁影像中不同的組成成分，我們使用不同的壓縮技術來壓縮它們，以改善整體的壓縮率。良好的實驗結果，顯示了我們所提出方法的可行性。	zh_TW
dc.description.abstract	Based on image analysis and character recognition techniques, a system for digitizing a printed book automatically into a digital version is proposed. In the major work of character recognition, multi-type characters can be recognized. And no character image data need be used for learning; the system fonts are used as the reference characters directly. There exist four phases in the proposed system processes: character type classification, character recognition, page layout analysis, and digital book construction and display. In the phase of character type classification, four types of characters are dealt with, including Chinese characters in titles, and Chinese, alphanumerical, and punctuation characters in texts. A decision-tree method for classifying these character types is proposed. In the phase of character recognition, a method, which uses directly system font data without reference data learning, is proposed first. For printed Chinese characters in texts, a method to recognize them based on decision trees and template matching is proposed next. And for the other miscellaneous types of characters including Chinese characters in titles, and alphanumerical characters and punctuation characters in texts, a method based mainly on template matching is also proposed to recognize them. In these methods, pairs of image components are used extensively to help the recognition work. In the phase of page layout analysis, all the connected components are segmented out of image contents effectively using moment-preserving thresholding and region-growing techniques. Then, different compression techniques are utilized to reduce the data volumes of different components in the page images to improve the overall compression ratio for the resulting digital book. Good experimental results reveal the feasibility of the proposed methods.	en_US
dc.language.iso	en_US	en_US
dc.subject	多種類文字辨識	zh_TW
dc.subject	文字型別分類	zh_TW
dc.subject	文字辨識	zh_TW
dc.subject	系統字型資料	zh_TW
dc.subject	決策樹	zh_TW
dc.subject	樣板相配	zh_TW
dc.subject	電子書	zh_TW
dc.subject	multi-type character recognition	en_US
dc.subject	character type classification	en_US
dc.subject	optical character recognition	en_US
dc.subject	system font data	en_US
dc.subject	decision tree	en_US
dc.subject	template matching	en_US
dc.subject	digital book	en_US
dc.title	利用決策樹方法及直接使用系統字型資料作多種類文字辨識及電子書自動建構	zh_TW
dc.title	Multi-Class Character Recognition by Decision-Tree Approaches and Direct Use of System Font Data for Automatic Digital Book Construction	en_US
dc.type	Thesis	en_US
dc.contributor.department	資訊科學與工程研究所	zh_TW
顯示於類別：	畢業論文