標題: 手寫中文文句辨識之研究
A Study of Handwritten Chinese Text Recognition
作者: 董呈煌
Cheng-Huang Tung
李錫堅
Hsi-Jian Lee
資訊科學與工程研究所
關鍵字: 手寫中文文句辨識;手寫中文產生器;大分類;辨識模組;語言模式;前後文後處理;未知詞;;Handwritten Chinese recognition, generator; candidate selection; matching module;lang. model;
公開日期: 1993
摘要: 本論文提出一個手寫中文文句辨識系統,包含手寫中文文字辨識,前後文 處理及辭典的維護。首先,我們建立一個手寫中文產生器,用於產生手寫 中文字形。利用產生的手寫字形,可以計算切割文字影像成區塊的各種方 法之效能,亦可推導出對影像區塊的特徵抽取。在效能評估之後,即可建 立一個含有大分類及辨識模組的文字辨識系統。因為文字辨識仍然消耗多 數的執行時間,本論文提出了多階層的前大分類模組,以減少執行時間。 每一個階層利用從輸入字形抽取出的單一特徵,以除去資料庫中不被認同 的字。而在前大分類中所使用的特徵順序,是依據由訓練字庫計算出的消 減率大小排定。我們又提出以分類樹做前大分類的方法。實驗結果顯示所 提出的模式可以有效地降低執行時間,而又不降低文字辨識的精確度。為 了要增加文字辨識的準確度,本論文更提出一個新的方法,來偵測與改正 被辨識模組認錯的輸入字形。此方法包含兩個辨識模組,同時識別輸入字 形。如果這兩個辨識模組的辨識結果不同,輸入字形即被駁回。經由使被 接受訓練字形的準確度達到最大之過程,可建立第二個辨識模組。因為辨 識階段能正確的辨識大多數的輸入字形,同時對每一被駁回的輸入字形僅 輸出少量的候選字。對每一個被駁回的輸入字形,可根據前後文的訊息, 利用字的二元馬可夫語言模式,準確的選擇一個最好的候選字。因為語言 模式可以任意地與其他的辨識系統結合使用,本論文提出以詞集為本的新 語言模式,進一步提升語言模式的效能。辭典中的詞預期以語意來分詞集 ,但是原本的辭典內並不含有語意訊息,必須藉著使用另一辭典–同義詞 詞林,訓練詞群的語意特徵。再依據語意的相近度,可以將所有詞群分為 m個詞集。查詢該m個詞集,即可將原辭典中的詞分成m個詞集。因此描述 二元前後文訊息的參數空間只需要 m*m。由實驗得知,這個新語言模式的 效能比字的二元語言模式的效能高出3.2%。在前後文後處理中,找出包 含於候選字集中的詞需用最多時間,為了使這項工作更有效率,辭典中的 詞序是依據詞首的兩個字來排序。因為包含所有詞首兩字的索引陣列是呈 稀疏狀,本論文使用了列位移方法壓縮該稀疏索引陣列。實驗值顯示,壓 縮率可達到 224,而且使用具有這種結構的辭典,可以很快地找出含在候 選字集中的詞,使前後文後處理的速度大幅加快。 In this thesis,we propose a Chinese text processing system for handwritten Chinese character recognition,contextual postprocessing and maintenance of the dictionary.A handwritten Chinese character generator is created for generating handwritten Chinese character images.By utilizing the generated character images,we measure the performance of segmenting an image into a number of meshes by different methods,and derive the feature extraction for an image mesh. After the performance measurement,a character recognition system consisting of a candidate selection module and a matching module is established. Because character recognition still takes much execution time,we propose a multi-stage candidate pre-selection module to reduce the execution time. In each stage,we use a single feature computed from the input character image to eliminate impossible character categories. The features used in candidate pre- selection are ordered according to the reduction rates evaluated from a set of training characters. We also propose a method for organizing the character database as a classification tree. The experimental results show that the proposed model can reduce the total execution time significantly without decreasing the precision of character recognition. We present a new approach for detecting and correcting characters erroneously identified by the matching module. Two matching modules are applied at the recognition stage to recognize an input character image simultaneously. If the matching results of the two modules for a character image are not the same,the character image is rejected at the recognition stage. Here,we construct the second recognition module by maximizing the accuracy of the accepted training characters. Because the recognition stage recognizes most of the input characters correctly and outputs a small number of candidates for each rejected character,a character bigram Markov language model can be applied to choose a candidate with high recognition rate.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT820392082
http://hdl.handle.net/11536/57893
顯示於類別:畢業論文