中文名片之辨識

標題:	中文名片之辨識 Recognition of Chinese Business Cards
作者:	邱耀輝 Chiou, Yaw-Huei 李錫堅 Hsi-Jian Lee 資訊科學與工程研究所
關鍵字:	名片;彩色;前景分離;文字抽取;文字辨識;文字分群;business card;color;foreground separation;character extraction;character recognition;character grouping
公開日期:	1995
摘要:	名片中有許多的資訊，例如姓名、住址和電話。如果能自動的抽取這些資訊並建成資料庫，將更能有效率的使用這些資訊。本論文的目的是要從名片中自動抽出字元並加以辨識。名片的種類繁多、格式多變，有些名片含有標幟或線，有些名片的底色甚至是彩色的。這些問題都將造成名片辨識上的困難。首先，我們為了要將名片影像的前景分離出來。我們先將所有的像素(pixel)分成八個顏色型別，這些顏色型別包括黑色、白色、紅色、綠色、藍色、黃色、青色、紫紅色，接者我們用這些顏色資訊計算出一個動態的臨界值(threshold)，再用這個臨界值將前景分離出來。其次，我們用三個主要的模組來抽出名片中的字元。這四個模組分別是：相連元件抽取(connected component extraction)、區域二值化( local thresholding)、標幟、線段和污點的去除(mark, line and noise deletion)以及字元群聚(character grouping)。相連元件抽取是要找到所有可能是字元的區塊，而區域二值化是用來改進影像的品質。標幟、線段及污點的去除則是利用了我們蒐集來的訓練名片中的知識去完成。另外，字元群聚是將相關字元聚集在一起，以便修正字元區塊。最後我們將抽取出來的字元送到以統計式為基礎的中文及英文辨識系統去處理。我們將辨識結果的前十名候選字以一視窗界面顯示出來。如果辨識結果不正確，使用者可以用滑鼠選一個正確的候選字將結果更正。我們將最終結果存到資料庫中。我們測試了三十張名片。在那些名片中含有中文、英文、數字及標點符號。目前我們系統對名片的字元抽出率(extraction rate) 以及正確率(accuracy rate)分別是 96.97 及 95.43%。另外中文字的辨識率為 88.78%，英文、數字及標點符號的平均辨識率可達97.58%。 Business cards include many kinds of information such as name, address, andtelephone number. In order to use the information effectively, it is necessaryto extract the information from the cards automatically to build a data base. The goal of this thesis is to extract and recognize characters from color business cards. The styles of business cards varies greatly. Some marks andlines may appear on the cards, and some cards are color. These problems will make the recognition of business cards difficult.To separate the foreground from the background, we assign all pixels into eight color types: black, white, red, green, blue, yellow, cyan, and magenta. Then we calculate a dynamic threshold using the color information to extract the foreground. Next, we extract the characters by four main modules: connected component extraction, local thresholding, mark, line and noise deletion and character grouping. Connected components are found to represent character candidates and local thresholding is used to improve the image quality. Marks,noise and lines are finally deleted from the components extracted using some size information we collect from training business cards. Character grouping isused to group the related characters and correct the character blocks by using the groups. Finally, we send the characters to statistic-based Chinese and English character recognition systems. We show the top ten candidates recognized in a user-friendly Windows-based interface. Users can use a mouse to select correct candidates to replace wrong results. We store the final results in the database.We test 30 business cards, which have Chinese characters, English characters, numerals and punctuation marks. The extraction rate and accuracy rate for our system are 96.97% and 95.43% respectively. The recognition rate for Chinese characters is 88.78% and the average recognition rate for English characters, numerals and punctuation marks can reach 97.58%.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT840392039 http://hdl.handle.net/11536/60383
顯示於類別：	畢業論文