标题: | 在中文OCR系统中侦测并且调整斜体文字 Detection and Orientation of Italic Text in Chinese OCR System |
作者: | 黄士晋 Shih-Jin Huang 刘振汉 Jenn-Hann Liou 资讯科学与工程研究所 |
关键字: | 斜体字;中文光学文字辩识;italic text;COCR |
公开日期: | 1998 |
摘要: | 中文OCR(中文光学文字辨识)是一个已经被研究的很透彻的研究题目。不论是在学术理论或是产业界的研究上,都已经能提供辨识率高达95%或是更好的中文光学文字辨识软体。 但是中文OCR软体还是有一些关键性的问题需要去解决,使得中文OCR的系统能够更加的完善。目前几乎所有的中文OCR软体都无法处理斜体字的问题。在传统的中文书籍□虽然不会出现斜体字,但是却常常在的科学相关的文章中或是名片等,看到斜体字的踪影。这篇论文即是在研究这个问题。这里我们不会尝试去辨识(recognize)文字,因为这方面的研究已经相当多而且也将这个问题处理的很好。我们将试着去〝找出斜体字〞,然后再〝重新定位〞斜体字(换句话说,就是将斜体字拉正为非斜体字)。 我们这里可以得到不错的结果,但是当斜体字的字串最后一个字和接下去的非斜体字的字元互相连接在一起时,我们通常无法准确的将这些相连接的字元切割开来,而产生较差的结果。 Chinese OCR (Chinese Optical Character Recognition) is a well studied subject. Research institutes and manufacturers have been able to provide software with recognition rate of 95% or better. Still there are some issues to solve to make a Chinese OCR system more satisfactory. Italic text handicaps almost all Chinese OCR software. It does not appear in tradition Chinese books. But it can be seen in most scientific articles, name cards, etc. This thesis study this problem. We do not try to "recognize" the characters because this is a subject very much studied and very well solved. We try to "find the italic text" and subsequently "reoriented" (i.e. transform it back to non-italic) it. We do come up with handsome results. But when the last character of an italic string touches the next non-italic character (this is not rare), we usually are unable to separate them and the result is poor. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#NT870392084 http://hdl.handle.net/11536/64109 |
显示于类别: | Thesis |