標題: 一個數學運算式辨識系統的設計
Design of A Mathematical Expression Recognition System
作者: 王俊勝
Jiumn-Shine Wang
李錫堅
Hsi-Jian Lee
資訊科學與工程研究所
關鍵字: 圖文切割;特徵抽取;運算式抽取;page segmentation;feature extraction;expression extraction
公開日期: 1993
摘要: 科學文件通常包含了文字和數學運算式。因此,從文件中抽取及辨認數學 運算式是非常重要的。在本篇論文中,我們提出一個系統去切割並辨認包 含有文字和數學運算式的文件。本系統可分為六個階段:一、文件切割及 標記;二、字元切割;三、特徵抽取;四、比對;五、運算式形成;六、 錯誤更正及運算式抽取。在文件切割及標記的階段,我們將文件中的文字 行切割出來,並且用兩個符號去標記它們。在字元切割階段,我們將文字 行中所有的字元切割出來。在特徵抽取階段,我們從每個字元中抽取出兩 個特徵,即長寬比和方向特徵向量。但是對於一些長寬比不固定的字元, 我們必需對方向特徵向量作一些修正,以得到修正後的方向特徵向量。在 比對階段,我們使用一個兩段式的比對演算法來辨認字元。在運算式形成 階段,對於每一文字行,我們建立一個符號關係樹來表示在此階段字元之 間彼此的關係。最後是錯誤更正及運算式抽取階段。在此階段,我們將文 字行分解成三種基本語法單元:運算元、運算子及分離者。並將此語法單 元用於錯誤更正及運算式抽取。在錯誤改正方面,我們使用一些啟發式的 規則來改正辨認上的錯誤。在運算式抽取方面,我們根據一些運算式的基 本形態,將文字行中的運算式抽取出來。現階段我們的資料庫包含了190 個符號。我們掃描了三張文件當作測試的資料,平均的辨識率大約 是96.16%。 A scientific document usually consists of texts and mathematical expressions. It is important to extract and recognize mathematical expressions from a document. In this thesis, we present a system to segment and recognize texts and mathematical expressions in a document. The system can be divided into six stages: 1)page segmentation and labeling, 2) character segmentation, 3)feature extraction, 4)matching, 5) expression formation, and 6)error correction and expression extraction. In the stage of page segmentation and labeling, we extract all text lines in a document and label them with two labels:TEXT label and W-EXP label. In the character segmentation stage, we separate all symbols in a text line. In the feature extraction stage, we extract direction feature vectors and aspect ratios from symbols. For some special symbols whose aspect ratios are not fixed, we modify the direction feature vectors to obtain modified direction feature vectors. In the matching stage, a two-stages matching algorithm is proposed to recognize symbols. In the expression formation stage, we build a symbol relation tree for each text line to represent the relationships among the symbols in the text line. In error correction and expression extraction stage, a text line is decomposed into three types of primitive tokens: operand, operator and separator. The primitive tokens will be used in error correction and expression extraction. We use some heuristic rules to correct the recognition errors that occur in a text line. After error correction, we extract all mathematical expressions according to some basic forms of expressions. Our database consists of 190 symbols in the current stage. Three pages of documents are scanned for testing. The average recognition rate is about 96.16%.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT820392019
http://hdl.handle.net/11536/57822
顯示於類別:畢業論文