Title: 統計式中文文字辨識的評估與加速
Evaluation and Speeding Up of Statistical Chinese Character Recognition
Authors: 郭啟璋
Kuo, Chi-Chang
Hsi-Jian Lee
Keywords: 筆劃穿越數特徵;邊緣向量特徵;最近平均值再分類演算法;字體確認;分支界定法;Crossing count features;Contour directional features;Nearest-mean reclassification algorithm;Font identification;Branch-and-bound algorithm
Issue Date: 1995
Abstract: 本論文將研究手寫中文文字辨識所使用特徵的選擇,分群加速的方法,
。 首先我們將一個輸入文字以不均勻(nonuniform)方式切割成8x8的格子
crossing count features)和邊緣向量特徵(contour directional
群是採用最近平均值再分類演算法(Nearest-mean reclassification
不必比對5401個字,只須對候選字做比對即可。 經由資料庫訓練(
,以達到加速的效果。 中文印刷文件常包含不同的字體,為提高辨識率
所以我們在辨識前必須做字體的確認(Font identification)。我們統計
不同字體的特徵,算出Fuzzy原理中的membership function,然後用決策
樹的觀念去確認字體。 目前我們的系統正確辨識手寫字的正確率可達91%
The goal of this thesis is to evaluate the performance of
statistical features, to speed up the execution time in
handwritten Chinese character recognition, and to identify font
types in machine-printed Chinese characters. Chinese contains
5401 commonly-used characters. Some Chinese characters are very
similar and handwritings have wide variations by different
writers. These are the problems to be solved in Chinese OCR.
First, we segment an input character image into sub-regions
nonuniformly to absorb the difference between different persons.
We select crossing count features and contour directional
features as recognition basis. The former features are also used
for clustering. We segment 5401 Chinese categories into 300
clusters by means of the nearest-mean reclassification algorithm
and select the top n clusters. The unions of characters in these
clusters are taken as candidate characters which will be further
matched.In the training period we compute the variances of n
features among 100 samples of each category. If the variance of
a certain feature is large, it represents that the feature is
steady in this dimension, we give a smaller weight to this
feature; otherwise we give a bigger weight. We use the branch-
and-bound matching algorithm to speed up the system.Chinese
articles generally contain several types of font. In order to
increase the recognition rate, we must identify the font types.
A membership function is constructed for each font distribution
of each feature independently. We then use a decision tree that
is based on the membership functions to identify the font type
of each input character.In the handwritten recognition system,
we test 5401 Chinese characters and the recognition rate is 91%.
We can recognize 2.66 characters per second in average. In the
machine-printed recognition system, we test 1500 characters, and
the font identification rate is 82%.
Appears in Collections:Thesis