標題: 中文文件處理系統中使用之多核心辨識方法與簡化型語言模式
Multi-kernel Chinese Characters Recognition and A Simplified Language Model Used in General Document Processing Systems
作者: 趙善隆
Zhao,San-Lung
李錫堅
Lee, Hsi-Jian
資訊科學與工程研究所
關鍵字: 文字辨識;雙核心;切字;傾斜校正;影像處理;語言模組;OCR;multi-kernel;segmentation;deskew;image processing;language model
公開日期: 1999
摘要: 本論文的研究目的在於建立一個包含前處理、辨識核心及後處理的一般中文文件處理系統,所要處理的文件可能有點傾斜,這種些微的傾斜可能會影響到切字及辨識,我們嘗試使用一種方法偵測傾角度的並提出一個快速的傾斜校正方法來旋轉文件影像,在這個系統中,我們需要影像中文字和句子的資訊,因此必須先針對文件影像作文字區塊、行及字的影像切割,後來利用偵測標點符號來將文字影像組合成句子。 在辨識模組部分,我們使用了兩個辨識核心來組合成辨識模組,第一個辨識核心我們選用contour directional features及crossing count features這兩組特徵,第二個辨識核心我們選用Oka’s cellular features及peripheral background area features這兩組特徵,特徵之間及核心之間的比重和文字影像的相對筆劃寬度有關(relative stroke width)。我們建立辨識核心時會有一個訓練過程,將文件影像中的文字影像切割出來當作訓練用的資料庫,為了訓練出較穩定的辨識核心,我們嘗試由訓練用資料庫中偵測並移除較不穩定的特徵代替了移除不穩定的文字影像。 本系統的後處理模組使用的是一個簡化型語言模組,這個模組包含了四個部分:設定可能是詞的範圍、詞庫比對順序的建立、快速詞庫比對方法及選擇最有信心的詞,這個模組可以大大提昇一般後處理的速度。 目前我們系統實作測試了超過40篇的文件影像,實驗結果顯示這個系統相當的有效及快速。
The goal of this thesis is to propose a general Chinese document processing systems which consists of three modules: preprocessing, recognition kernel, and postprocessing. In the preprocessing module, input images probably have small skew angles. These skew angles will affect the performance of character segmentation and character recognition. A skew angle detection method is used and a modified rotate transform is proposed to rotate document images. In our system, sentences and characters must be extracted for recognition engines. For this purpose, document images must be segmented into text blocks, text lines, and character images. After we detect the punctuation marks in the character images, we construct sentences from character images. In the recognition module, we use two recognition engines to recognize the character images. Contour directional features and crossing count features are selected for kernel 1 and Oka's cellular features and peripheral background area features are selected for kernel 2. The weights of these kernels and features are related to the relative stroke widths of character images which provide measurements about character image quality. When we construct recognition engines, the features are trained from a character image database selecting from document images. To provide more robust training features to increase the recognition rate, bad features instead of bad images are removed in the feature training process. In the post-processing module, a simplified language model is used. The model includes word selection bound setting, matching order establishing, fast word matching, and most-confident word selection. By using this model, the processing can be speed-up. The experiments performed on more than 40 articles images show the system we propose here is very effective and efficient.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT880392070
http://hdl.handle.net/11536/65471
Appears in Collections:Thesis