使用混淆文字來改善文字辨識率

Title:	使用混淆文字來改善文字辨識率 Using Confusion Characters to Improve Character Recognition Rate
Authors:	林裕章 Lin, Yeu-Chang 李錫堅 Hsi-Jian Lee 資訊科學與工程研究所
Keywords:	混淆文字;文字辨識率;confusion characters;character recognition rate
Issue Date:	1995
Abstract:	本論文提出一個經由混淆字集(confusion sets)的分析來改善文字辨識率的方法。與一個文字相似的混淆字，可經由 CCL/HCCR資料庫中取出文字的100個訓練樣本，分析其樣本經過辨識後的結果來收集。在訓練階段裡，一個文字首先經過文字辨識系統OCR1辨識並輸出前5個候選字，根據每個候選字出現的位置給它一個相反位置大小( reversed-rank)的權重(weight)。當5401字的每個訓練樣本都被辨識過後，5401字的混淆字集就可以建立完成。決定是否為混淆字是當一個文字被辨識後，其辨識結果中候選字的權重大於一個臨界值( threshold)時，此文字就被加進候選字的混淆字集中。在混淆字集中的相似字，在某些特定的部位是有較大差距的。我們挑選這些差異部位的特徵值來分辨相似字。然而當混淆字集中的字數太多時，是很難去找到合適的特徵值。所以我們使用筆劃穿越數特徵值(crossing count feature)，針對每一個混淆字集將其中的文字再群聚成一些小群，使真正更相似的字能被分在同一個子群之中。最後針對每個子群中的相似字，對於他們的特徵值計算權重，這些權重就用來衡量特徵值在做分辨相似字時的重要性。在辨識階段裡，對於每一個輸入到辨識系統OCR1中的文字，經過辨識後可得到一個最有可能的輸出候選字，將這個輸出候選字相對應之混淆字集中的文字，再經由辨識系統 OCR2衡量各個字的特徵值之重要性後做最後的辨識。實驗顯示正確字落入混淆字集子群中的比率可到93. 76%，此時混淆子群的平均大小為9.01個字。最後我們總體辨識系統的辨識率是86.97%。 This thesis proposes a confusion set analysis approach to improve character recognition rate. We collect all confused characters of a character by analyzing the recognition results of 100 training samples taken from the CCL/HCCR database. In the training phase, a character is first recognized by an OCR system OCR1 and top five candidates are outputted. Each candidate of the recognition results is assigned a weight in reversed-rank. After all training samples of all 5401 characters are trained, we create a confusion set for each character. If the reversed-rank sum of an output candidate is greater than a threshold, the input character is stored in the confusion set of the output candidate. Similar characters in a confusion set are different in certain parts. We select features in these distinct parts to distinguish the similar characters. However, it is hard to select suitable features when the size of a confusion set is large. We cluster the characters of the confusion set into subgroups by using crossing count features. Then, we calculate the weights for all features among the characters in each subgroup to select the most distinguishable features. In the recognition phase, we obtain the most possible output character by the OCR system OCR1 for each input character. The characters in the confusion set with respect to the output character are recognized by another OCR system OCR2 to find the final recognition result. Experimental results show that the hit rates of the subgroups of the confusion sets are 93. 76% and the average size of the subgroups is 9.01. The final recognition rate for our system is 86.97%.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT840392066 http://hdl.handle.net/11536/60412
Appears in Collections:	Thesis

APA	林., Lin, Y., 李., & Hsi-Jian L. (1995). 使用混淆文字來改善文字辨識率. http://hdl.handle.net/11536/60412.
Bibtex	@article{林裕章 and Lin1995, title={使用混淆文字來改善文字辨識率}, author={林裕章 and Lin, Yeu-Chang and 李錫堅 and Hsi-Jian Lee}, journal={http://hdl.handle.net/11536/60412}, year={1995}, url={https://ir.lib.nycu.edu.tw/handle/11536/60412}, }