已知表格中與線重疊之中文字辨識

標題:	已知表格中與線重疊之中文字辨識 Recognition of Chinese Characters with Overlapped Lines in Known Forms
作者:	高國忠 Kuo-Chung Kao 李錫堅 Hsi-Jian Lee 資訊科學與工程研究所
關鍵字:	受干擾的字;中文辨識;文字重建;interfered-characters;chinese recognition;characters reconstruction
公開日期:	1998
摘要:	在這篇論文裡我們提出兩種方法來處理有關於在已知表格中與線重疊之文字辨識，這些受干擾的文字(interfered-characters)不能正確的被抽取出來，而且對於傳統的光學文字辨識(optical character recognition ,OCR)核心會導致辨識錯誤。第一種方法是移除表格線然後重建文字:當線移除後，文字就會破碎而且筆劃會分成兩群筆劃端(stroke-ends)，我們利用筆劃端的共線性(colinearity)和位置來找出正確的連接對應，同時，破碎筆劃之間的縫隙填補能儘量重建成原來的文字。第二種方法是修改OCR來辨識受干擾的文字:我們依據投影資訊來均勻分割含有表格線的印刷字體，我們找出表格線在受干擾字的位置並且計算表格線的兩種特徵值:CDFs(contour-direction features)和CCFs(crossing-count features)，所以我們根據表格線的特徵值修正OCR的特徵值來辨識這些受干擾的文字。在第一個實驗中，我們先用938個受干擾的手寫字做測試，辨識率為23.7%，經過使用第一種方法之後，辨識率提升到78.3%。在第二個實驗中，我們測試了695個與線重疊的印刷字，辨識率為64.3%，當使用第二種方法之後，辨識率增加到77.3%。 The thesis aims to provide two methods to deal with the recognition of characters overlapping with lines in known forms. The interfered-characters can't be extracted from the text lines exactly and the traditional OCR engines will fail to recognize characters with interference. The first method is to remove form lines and reconstruct characters. Characters are broken with line removal and strokes are separated into two sets of stroke-ends. The colinearity and position of the stroke-ends are used to find out correct connecting correspondences. Gaps of the broken strokes are filled to reconstruct the original characters. The second method is to modify the OCR model to fit interfered-characters. Printed characters with form lines are uniformly segmented according to projection profiles. The locations of form lines in the interfered-characters are extracted and both CDFs (contour-direction features) and CCFs (crossing-count features) of form lines are calculated. Trained features of the OCR engine are modified by the features of form lines to match interfered-characters. In the first experiment, 938 handwritten characters with form lines are tested, and the recognition rate is 23.7%. After using the first method, the accuracy is raised to 78.3%. In the second experiment, 695 printed characters with form lines are tested, and the recognition rate is 64.3%. After using the second method, the accuracy is increased to 77.3%.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT870392086 http://hdl.handle.net/11536/64111
顯示於類別：	畢業論文