標題: 統計式中文文字辨識的評估與加速
Evaluation and Speeding Up of Statistical Chinese Character Recognition
作者: 郭啟璋
Kuo, Chi-Chang
李錫堅
Hsi-Jian Lee
資訊科學與工程研究所
關鍵字: 筆劃穿越數特徵;邊緣向量特徵;最近平均值再分類演算法;字體確認;分支界定法;Crossing count features;Contour directional features;Nearest-mean reclassification algorithm;Font identification;Branch-and-bound algorithm
公開日期: 1995
摘要: 本論文將研究手寫中文文字辨識所使用特徵的選擇,分群加速的方法, 和印刷體中文字的字體確認。中文字有5401個常用字,有些字的形狀非常 類似,且不同的人有不同的書寫習慣,這些都會造成中文文字辨識的問題 。 首先我們將一個輸入文字以不均勻(nonuniform)方式切割成8x8的格子 ,目的在吸收不同人書寫習慣的差異。再來我們將找出筆劃穿越數特徵( crossing count features)和邊緣向量特徵(contour directional features)倆組特徵作為辨識的基礎。前者也將作為前置分群的特徵。分 群是採用最近平均值再分類演算法(Nearest-mean reclassification algorithm)。根據實驗,效果最好的狀況是將所有中文字分成300群,再 從其中挑出比較像的候選字(candidate),等到最後比對(matching)時就 不必比對5401個字,只須對候選字做比對即可。 經由資料庫訓練( training)的過程中,我們算出各個特徵值在所有樣本中的變異數( variance)。比對過程時考慮這個變異數,變異數大者表示此特徵很不穩 定,我們給它一個較小的權重。而變異數小者表示此特徵是極穩定的,我 們給它一個較大的權重。我們由分支界定演算法(Branch-and-bound algorithm)決定各特徵值比對次序的先後關係,經由過濾掉不必要的運算 ,以達到加速的效果。 中文印刷文件常包含不同的字體,為提高辨識率 所以我們在辨識前必須做字體的確認(Font identification)。我們統計 不同字體的特徵,算出Fuzzy原理中的membership function,然後用決策 樹的觀念去確認字體。 目前我們的系統正確辨識手寫字的正確率可達91% ,測試樣本為5401個中文手寫字,在工作站上的辨識速度一秒鐘可辨 識2.66個中文字。印刷體中文字測試樣本包括1500個中文字,確認字體的 正確率為82%。 The goal of this thesis is to evaluate the performance of statistical features, to speed up the execution time in handwritten Chinese character recognition, and to identify font types in machine-printed Chinese characters. Chinese contains 5401 commonly-used characters. Some Chinese characters are very similar and handwritings have wide variations by different writers. These are the problems to be solved in Chinese OCR. First, we segment an input character image into sub-regions nonuniformly to absorb the difference between different persons. We select crossing count features and contour directional features as recognition basis. The former features are also used for clustering. We segment 5401 Chinese categories into 300 clusters by means of the nearest-mean reclassification algorithm and select the top n clusters. The unions of characters in these clusters are taken as candidate characters which will be further matched.In the training period we compute the variances of n features among 100 samples of each category. If the variance of a certain feature is large, it represents that the feature is steady in this dimension, we give a smaller weight to this feature; otherwise we give a bigger weight. We use the branch- and-bound matching algorithm to speed up the system.Chinese articles generally contain several types of font. In order to increase the recognition rate, we must identify the font types. A membership function is constructed for each font distribution of each feature independently. We then use a decision tree that is based on the membership functions to identify the font type of each input character.In the handwritten recognition system, we test 5401 Chinese characters and the recognition rate is 91%. We can recognize 2.66 characters per second in average. In the machine-printed recognition system, we test 1500 characters, and the font identification rate is 82%.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT840392041
http://hdl.handle.net/11536/60385
顯示於類別:畢業論文