標題: | 在中文OCR系統中偵測並且調整斜體文字 Detection and Orientation of Italic Text in Chinese OCR System |
作者: | 黃士晉 Shih-Jin Huang 劉振漢 Jenn-Hann Liou 資訊科學與工程研究所 |
關鍵字: | 斜體字;中文光學文字辯識;italic text;COCR |
公開日期: | 1998 |
摘要: | 中文OCR(中文光學文字辨識)是一個已經被研究的很透徹的研究題目。不論是在學術理論或是產業界的研究上,都已經能提供辨識率高達95%或是更好的中文光學文字辨識軟體。
但是中文OCR軟體還是有一些關鍵性的問題需要去解決,使得中文OCR的系統能夠更加的完善。目前幾乎所有的中文OCR軟體都無法處理斜體字的問題。在傳統的中文書籍□雖然不會出現斜體字,但是卻常常在的科學相關的文章中或是名片等,看到斜體字的蹤影。這篇論文即是在研究這個問題。這裡我們不會嘗試去辨識(recognize)文字,因為這方面的研究已經相當多而且也將這個問題處理的很好。我們將試著去〝找出斜體字〞,然後再〝重新定位〞斜體字(換句話說,就是將斜體字拉正為非斜體字)。
我們這裡可以得到不錯的結果,但是當斜體字的字串最後一個字和接下去的非斜體字的字元互相連接在一起時,我們通常無法準確的將這些相連接的字元切割開來,而產生較差的結果。 Chinese OCR (Chinese Optical Character Recognition) is a well studied subject. Research institutes and manufacturers have been able to provide software with recognition rate of 95% or better. Still there are some issues to solve to make a Chinese OCR system more satisfactory. Italic text handicaps almost all Chinese OCR software. It does not appear in tradition Chinese books. But it can be seen in most scientific articles, name cards, etc. This thesis study this problem. We do not try to "recognize" the characters because this is a subject very much studied and very well solved. We try to "find the italic text" and subsequently "reoriented" (i.e. transform it back to non-italic) it. We do come up with handsome results. But when the last character of an italic string touches the next non-italic character (this is not rare), we usually are unable to separate them and the result is poor. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#NT870392084 http://hdl.handle.net/11536/64109 |
Appears in Collections: | Thesis |