統計式中文文字辨識的評估與加速

Title:	統計式中文文字辨識的評估與加速 Evaluation and Speeding Up of Statistical Chinese Character Recognition
Authors:	郭啟璋 Kuo, Chi-Chang 李錫堅 Hsi-Jian Lee 資訊科學與工程研究所
Keywords:	筆劃穿越數特徵;邊緣向量特徵;最近平均值再分類演算法;字體確認;分支界定法;Crossing count features;Contour directional features;Nearest-mean reclassification algorithm;Font identification;Branch-and-bound algorithm
Issue Date:	1995
Abstract:	本論文將研究手寫中文文字辨識所使用特徵的選擇，分群加速的方法，和印刷體中文字的字體確認。中文字有5401個常用字，有些字的形狀非常類似，且不同的人有不同的書寫習慣，這些都會造成中文文字辨識的問題。首先我們將一個輸入文字以不均勻(nonuniform)方式切割成8x8的格子，目的在吸收不同人書寫習慣的差異。再來我們將找出筆劃穿越數特徵( crossing count features)和邊緣向量特徵(contour directional features)倆組特徵作為辨識的基礎。前者也將作為前置分群的特徵。分群是採用最近平均值再分類演算法(Nearest-mean reclassification algorithm)。根據實驗，效果最好的狀況是將所有中文字分成300群，再從其中挑出比較像的候選字(candidate)，等到最後比對(matching)時就不必比對5401個字，只須對候選字做比對即可。經由資料庫訓練( training)的過程中，我們算出各個特徵值在所有樣本中的變異數( variance)。比對過程時考慮這個變異數，變異數大者表示此特徵很不穩定，我們給它一個較小的權重。而變異數小者表示此特徵是極穩定的，我們給它一個較大的權重。我們由分支界定演算法(Branch-and-bound algorithm)決定各特徵值比對次序的先後關係，經由過濾掉不必要的運算，以達到加速的效果。中文印刷文件常包含不同的字體，為提高辨識率所以我們在辨識前必須做字體的確認(Font identification)。我們統計不同字體的特徵，算出Fuzzy原理中的membership function，然後用決策樹的觀念去確認字體。目前我們的系統正確辨識手寫字的正確率可達91% ，測試樣本為5401個中文手寫字，在工作站上的辨識速度一秒鐘可辨識2.66個中文字。印刷體中文字測試樣本包括1500個中文字，確認字體的正確率為82%。 The goal of this thesis is to evaluate the performance of statistical features, to speed up the execution time in handwritten Chinese character recognition, and to identify font types in machine-printed Chinese characters. Chinese contains 5401 commonly-used characters. Some Chinese characters are very similar and handwritings have wide variations by different writers. These are the problems to be solved in Chinese OCR. First, we segment an input character image into sub-regions nonuniformly to absorb the difference between different persons. We select crossing count features and contour directional features as recognition basis. The former features are also used for clustering. We segment 5401 Chinese categories into 300 clusters by means of the nearest-mean reclassification algorithm and select the top n clusters. The unions of characters in these clusters are taken as candidate characters which will be further matched.In the training period we compute the variances of n features among 100 samples of each category. If the variance of a certain feature is large, it represents that the feature is steady in this dimension, we give a smaller weight to this feature; otherwise we give a bigger weight. We use the branch- and-bound matching algorithm to speed up the system.Chinese articles generally contain several types of font. In order to increase the recognition rate, we must identify the font types. A membership function is constructed for each font distribution of each feature independently. We then use a decision tree that is based on the membership functions to identify the font type of each input character.In the handwritten recognition system, we test 5401 Chinese characters and the recognition rate is 91%. We can recognize 2.66 characters per second in average. In the machine-printed recognition system, we test 1500 characters, and the font identification rate is 82%.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT840392041 http://hdl.handle.net/11536/60385
Appears in Collections:	Thesis

APA	郭., Kuo, C., 李., & Hsi-Jian L. (1995). 統計式中文文字辨識的評估與加速. http://hdl.handle.net/11536/60385.
Bibtex	@article{郭啟璋 and Kuo1995, title={統計式中文文字辨識的評估與加速}, author={郭啟璋 and Kuo, Chi-Chang and 李錫堅 and Hsi-Jian Lee}, journal={http://hdl.handle.net/11536/60385}, year={1995}, url={https://ir.lib.nycu.edu.tw/handle/11536/60385}, }