提升中文文字辨識效能的技術

標題:	提升中文文字辨識效能的技術 Performance Improvement Techniques for Chinese Character Recognition
作者:	黃仁贊 Huang, Jen Tsan 李錫堅 Lee, Hsi-Jian 資訊科學與工程研究所
關鍵字:	中文;辨識;Chinese;Recognition
公開日期:	2000
摘要:	在這篇論文中，我們設計了一個多字型的中文文字光學文字辨識系統。中文字包含了5401個常用字。文字中存在著多樣的字型跟斜體的樣式會降低辨識的效能。這篇論文的目的是增進多字型中文文字辨識的效能。我們的系統包含四個主要的模組: 分群的分析，斜體字的偵測及辯識，明體字的識別跟修正辨識的結果。我們修改K-means演算法來對我們的訓練樣本做分群。在辨識的階段中，識別明體字可以減少辨識的候選字集。我們首先偵測文字頃斜的角度。如果文字為斜體字，我們會將字推正。然後偵測文字的字體。如果文字為名體字，我們使用明體字辨識核心來辯識，否則就用其餘字型的辯識核心。我們修改了branch-and-bound method來加速我們的文字辨識。在後處理中，我們使用字型的資訊來修正辯識結果，並選出最適合的候選字。在斜體字的偵測中，我們測試了335行文字，識別率為95.71％。在明體字的識別中，我們測試了1175行文字，識別率為99.74％。在辯識結果的修正中，我們測試了400行文字其中包括2556個字，其辨識的正確率從98.27％提升到了99.02％。 In this thesis, we design a multi-font OCR system for improving the performance of Chinese character recognition. Chinese contains 5401 commonly-used characters. The existence of characters with varians typefaces and italic style will decrease the performance of character recognition significantly. This thesis aims to improve the performance of a multi-font Chinese character recognition system. The procedure of the proposed system consists of four main modules: clustering analysis, detection and recognition of italic-characters, identification of Ming font characters, and correction of recognition results. We modify the K-means algorithm to cluster our training samples. Identification of Ming font characters will reduce the recognition set in the recognition phase. In the recognition phase, we first obtain the angle in the italic-characters detection procedure. If the text line is in italic style, we straighten the characters in the text line to the general Roman direction. Then we determine the typeface of the text line. If the typeface of the text line is Ming font, we recognize the text line by Ming font kernel; else by other font kernel. In the detail matching procedure, we modify branch-and-bound method to speedup character recognition. In the post-processing phase, we use the typeface information to correct the recognition results and select the most promising recognition candidate. In the detection of italic-characters, we test 335 text lines. The identification rate is 95.71%. In the identification of Ming font characters, we test 1175 text lines; the identification rate is 99.74%. In the correction of recognition result, we test 400 text lines which contain 2556 characters; the recognition rate is improved from 98.27% to 99.02%.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT890392058 http://hdl.handle.net/11536/66848
顯示於類別：	畢業論文