標題: 使用混淆文字來改善文字辨識率
Using Confusion Characters to Improve Character Recognition Rate
作者: 林裕章
Lin, Yeu-Chang
李錫堅
Hsi-Jian Lee
資訊科學與工程研究所
關鍵字: 混淆文字;文字辨識率;confusion characters;character recognition rate
公開日期: 1995
摘要: 本論文提出一個經由混淆字集(confusion sets)的分析來 改善文字 辨識率的方法。與一個文字相似的混淆字,可經由 CCL/HCCR資料庫中 取出文字的100個訓練樣本,分析其 樣本經過辨識後的結果來收 集。在訓練階段裡,一個文字首 先經過文字辨識系統OCR1辨識並輸出 前5個候選字,根據 每個候選字出現的位置給它一個相反位置大小( reversed-rank)的權重(weight)。當5401字的每個訓練樣本都被辨識過後 ,5401字的混淆字集就可以建立完成。決定是否為混淆字 是當一個 文字被辨識後,其辨識結果中候選字的權重大於一 個臨界值( threshold)時,此文字就被加進候選字的混淆 字集中。 在混淆字集中的相似字,在某些特定的部位是有較大差 距的。我 們挑選這些差異部位的特徵值來分辨相似字。然而 當混淆字集中的字 數太多時,是很難去找到合適的特徵值。 所以我們使用筆劃穿越數特 徵值(crossing count feature), 針對每一個混淆字集將其中的文字再 群聚成一些小群,使真 正更相似的字能被分在同一個子群之中。最後 針對每個子群中 的相似字,對於他們的特徵值計算權重,這些權重就用 來衡量 特徵值在做分辨相似字時的重要性。 在 辨識階段裡,對於每一個輸入到辨識系統OCR1中的 文字,經過 辨識後可得到一個最有可能的輸出候選字,將這 個輸出候選字相對應 之混淆字集中的文字,再經由辨識系統 OCR2衡量各個字的特徵值之重 要性後做最後的辨識。實驗 顯示正確字落入混淆字集子群中的比率 可到93. 76%,此時混 淆子群的平均大小為9.01個字。最後我們總體辨 識系統的辨 識率是86.97%。 This thesis proposes a confusion set analysis approach to improve character recognition rate. We collect all confused characters of a character by analyzing the recognition results of 100 training samples taken from the CCL/HCCR database. In the training phase, a character is first recognized by an OCR system OCR1 and top five candidates are outputted. Each candidate of the recognition results is assigned a weight in reversed-rank. After all training samples of all 5401 characters are trained, we create a confusion set for each character. If the reversed-rank sum of an output candidate is greater than a threshold, the input character is stored in the confusion set of the output candidate. Similar characters in a confusion set are different in certain parts. We select features in these distinct parts to distinguish the similar characters. However, it is hard to select suitable features when the size of a confusion set is large. We cluster the characters of the confusion set into subgroups by using crossing count features. Then, we calculate the weights for all features among the characters in each subgroup to select the most distinguishable features. In the recognition phase, we obtain the most possible output character by the OCR system OCR1 for each input character. The characters in the confusion set with respect to the output character are recognized by another OCR system OCR2 to find the final recognition result. Experimental results show that the hit rates of the subgroups of the confusion sets are 93. 76% and the average size of the subgroups is 9.01. The final recognition rate for our system is 86.97%.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT840392066
http://hdl.handle.net/11536/60412
Appears in Collections:Thesis