Title: 使用混淆文字來改善文字辨識率
Using Confusion Characters to Improve Character Recognition Rate
Authors: 林裕章
Lin, Yeu-Chang
李錫堅
Hsi-Jian Lee
資訊科學與工程研究所
Keywords: 混淆文字;文字辨識率;confusion characters;character recognition rate
Issue Date: 1995
Abstract: 本論文提出一個經由混淆字集(confusion sets)的分析來 改善文字
辨識率的方法。與一個文字相似的混淆字,可經由 CCL/HCCR資料庫中
取出文字的100個訓練樣本,分析其 樣本經過辨識後的結果來收
集。在訓練階段裡,一個文字首 先經過文字辨識系統OCR1辨識並輸出
前5個候選字,根據 每個候選字出現的位置給它一個相反位置大小(
reversed-rank)的權重(weight)。當5401字的每個訓練樣本都被辨識過後
,5401字的混淆字集就可以建立完成。決定是否為混淆字 是當一個
文字被辨識後,其辨識結果中候選字的權重大於一 個臨界值(
threshold)時,此文字就被加進候選字的混淆 字集中。
在混淆字集中的相似字,在某些特定的部位是有較大差 距的。我
們挑選這些差異部位的特徵值來分辨相似字。然而 當混淆字集中的字
數太多時,是很難去找到合適的特徵值。 所以我們使用筆劃穿越數特
徵值(crossing count feature), 針對每一個混淆字集將其中的文字再
群聚成一些小群,使真 正更相似的字能被分在同一個子群之中。最後
針對每個子群中 的相似字,對於他們的特徵值計算權重,這些權重就用
來衡量 特徵值在做分辨相似字時的重要性。 在
辨識階段裡,對於每一個輸入到辨識系統OCR1中的 文字,經過
辨識後可得到一個最有可能的輸出候選字,將這 個輸出候選字相對應
之混淆字集中的文字,再經由辨識系統 OCR2衡量各個字的特徵值之重
要性後做最後的辨識。實驗 顯示正確字落入混淆字集子群中的比率
可到93. 76%,此時混 淆子群的平均大小為9.01個字。最後我們總體辨
識系統的辨 識率是86.97%。
This thesis proposes a confusion set analysis approach to
improve character recognition rate. We collect all confused
characters of a character by analyzing the recognition results
of 100 training samples taken from the CCL/HCCR database. In
the training phase, a character is first recognized by an
OCR system OCR1 and top five candidates are outputted. Each
candidate of the recognition results is assigned a weight
in reversed-rank. After all training samples of all 5401
characters are trained, we create a confusion set for each
character. If the reversed-rank sum of an output candidate
is greater than a threshold, the input character is stored in
the confusion set of the output candidate. Similar
characters in a confusion set are different in certain
parts. We select features in these distinct parts to
distinguish the similar characters. However, it is hard to
select suitable features when the size of a confusion set is
large. We cluster the characters of the confusion set into
subgroups by using crossing count features. Then, we
calculate the weights for all features among the
characters in each subgroup to select the most
distinguishable features.
In the recognition phase, we obtain the most possible
output character by the OCR system OCR1 for each input
character. The characters in the confusion set with respect to
the output character are recognized by another OCR system OCR2
to find the final recognition result. Experimental results show
that the hit rates of the subgroups of the confusion sets are
93. 76% and the average size of the subgroups is 9.01. The final
recognition rate for our system is 86.97%.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT840392066
http://hdl.handle.net/11536/60412
Appears in Collections:Thesis