標題: 文件影像二值化暨手寫中文字切割與辨識之研究
A STUDY ON BINARIZATION, SEGMENTATION AND HANDWRITTEN CHARACTER RECOGNITION IN DOCUMENT IMAGES
作者: 曾逸鴻
Tseng, Yi-Hong
李錫堅
Lee, Hsi-Jian
資訊科學與工程研究所
關鍵字: 光學文字辨識;二值化;候選群選擇模組;改良的分支界定細比對模組;機率式維特比演算法;霍夫轉換;optical character recognition;binarization;candidate-cluster selection module;modified branch-and-bound detail-matching module;probabilistic Viterbi algorithm;Hough transform
公開日期: 1998
摘要: 在本論文中,我們提出方法來解決一些自動文件處理系統所遇到的問題。由於光學文字辨識(Optical Character Recognition, OCR)核心是由大量的字元樣本所訓練,而這些樣本通常是單一的黑白字元影像,而且沒有太嚴重的雜訊在其中。因此,本論文要解決的問題包括:文件影像的二值化、OCR核心的加速、手寫文字行的切割、受干擾字的辨識等。 文件通常由數個具有不同底色的區塊所組成,本論文首先提出一個區塊基礎的方法來做文件影像的二值化。我們先以一個兩層式區塊抽取方法,將文件影像細分成數個區塊,每個區塊內有個別的背景亮度。相連元件偵測法(connected-component detection)以及偏頗群長投影法(biased run-length projection)分別用在這兩層區塊抽取。對於每個區塊,我們分析像素的亮度分佈,來決定背景亮度所出現的範圍;利用此範圍,我們便可將像素分為前景與背景。對於不屬於所有區塊的其他像素,我們利用對整張影像求出的整體門檻值(global threshold),對其做二值化。我們所提方法的處理速度,較之其他區域調整方法(local adaptive methods)為快;而利用辨識準確性所評估的結果,顯示我們的方法較之其他整體門檻化方法(global thresholding methods)為佳。 本論文也描述了我們手寫中文辨識核心的建構方法。兩類統計特徵:通過筆畫數(crossing counts)及邊緣方向數(contour-direction counts),分別用在大分類(pre-classification)及細比對(detail matching)。為求在維持高辨識率的條件下提高比對速度,本系統採用了候選群選擇模組(candidate-cluster selection module)及改良的分支界定細比對模組(modified branch-and-bound detail-matching module),來加快整體的辨識速度。在PentiumII-233的個人電腦上,以工研院電通所的中文手寫資料庫CCL/HCCR測試的結果,平均辨識率約為89.95%,1秒鐘約可辨識8~10個字。 由於辨識核心只能辨認抽取出的單一字元,本論文提出一個以辨識為基礎的字元切割方法。首先利用機率式維特比演算法(probabilistic Viterbi algorithm)求出文字行中所有可能的切割途徑,再利用中文字的特性將一些不合理或多餘的途徑去除,剩下的候選途徑用來建構一個切割圖(segmentation graph)。圖中的節點代表一個候選途徑,並用邊線連接兩個距離夠近的節點,每條邊線的成本利用辨識距離、字元方正度、和字元內部空格來決定。最後,最佳的切割路徑就訂在從切割圖中找出之最短路線(shortest path)上的每個節點。 在表格文件的處理過程,常常會發現有些填入資料寫在格線上,即使可將這些受干擾字(interfered characters)抽取出來,因有格線橫亙其中,破壞該文字的特徵,往往導致辨認錯誤。本論文利用干擾線去除(interfering-line removal)及特徵權重調整(feature weight adjustment)的方法來辨認此類受干擾字。我們先利用霍夫轉換(Hough transform)及投影外廓(projection-profiles),找出該擾線的位置;再利用黑群的偵測與分類,來區分干擾點與字元點,將所有干擾點去除,就可得到乾淨字元(clean characters)。為辨認效果不甚理想的乾淨字元,我們修改辨識核心,將干擾線通過的區域所抽出的特徵,在比對過程的權重降低,以減少干擾線對該字元辨識的影響,達到被干擾字的辨識效果,就如同這些字未受干擾一樣好。
In this thesis, we propose methods for solving several problems in an automatic document processing system. Since OCR engines are usually trained by character samples which are binary images without obvious interference, those problems explored in this thesis include document image binarization, the speedup of OCR engine, handwritten character segmentation, and interfered character recognition. A document generally consists of several blocks with different background intensity values. This thesis initially proposes a block-based method for binarizing these document images. We first apply a two-layer block extraction method to separate documents into several rectangular blocks whose background intensity values are individual. Connected-component detection and biased run-length projection are used in the first and the second layer, respectively. For each block, we analyze the distribution of intensity to determine background intensity values. The union of background intensity values is then adopted to binarize this block image. For those pixels outside all extracted blocks, we binarize them according to the global threshold value calculated from the entire document image. The speed of this proposed approach is greater than that of any local adaptive method. According to the evaluation of recognition accuracy, the binarization results of this method are still more satisfactory than that of any global thresholding method. This thesis also describes the implementation of our OCR engine for handwritten Chinese characters. Two statistical features, crossing counts and contour-direction counts, are used in pre-classification and detail-matching module respectively. In order to reduce the processing time without losing recognition accuracy, this system applied a candidate-cluster selection module and a modified branch-and-bound detail-matching module to speed up the OCR engine. Efficiency experiments are performed on 5401 100 Chinese characters selected from the database CCL/HCCR. The average recognition rate of this OCR engine is 89.95% and about 8~10 characters per second are recognized. Since OCR engines only recognize single characters extracted from text-lines, this thesis proposes a recognition-based character segmentation method. A probabilistic Viterbi algorithm is first used to detect possible nonlinear segmentation routes for a given text-line. Several properties of Chinese characters are then considered to remove irrational or redundant segmentation routes. The remaining routes, called candidate segmentation routes, are used to construct a segment graph. Each node of the graph represents a candidate route and an edge connects two nodes whose distance is less than a threshold. The cost in each edge is a function of character recognition distances, squareness of characters, and internal gaps in characters. Finally, we locate optimal segmentation routes on the nodes of the shortest path identified from the segmentation graph. In form documents, several filled-in data are overwritten onto form lines. These interfered characters are always mis-recognized even though they can be extracted. This thesis finally presents a method for recognizing these interfered characters by removing interfering-lines and adjusting feature weights. We first detect interfering-lines by the Hough transform and projection-profiles. Black-run detection and classification are then applied to distinguish interfering pixels and character pixels. In order to recognize several dissatisfactory clean characters generated by eliminating interfering pixels, the OCR engine is refined by adjusting the weights of features extracted from areas passed by interfering-lines. By reducing the influence of interfering-lines, the recognition accuracy of interfered characters is improved closely to that of characters without interference.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT870392101
http://hdl.handle.net/11536/64129
顯示於類別:畢業論文