標題: 生物醫學領域專有名詞萃取
Named Entity Extraction in Biomedical Domain
作者: 陳建行
Jian-Hsin Chen
梁婷
Tyne Liang
資訊科學與工程研究所
關鍵字: 生物醫學文件;專有名詞辨識;專有名詞分類;馬可夫模型;省略詞;Biomedical Literature;Named Entity Identification;Named Entity Classification;Hidden Markov Models;Ellipsis
公開日期: 2002
摘要: 本論文提出一個綜合式的生物醫學領域文件的自動化專有名詞辨識與分類系統。希望能藉此系統,以提供生物醫學物件關係之資訊萃取系統的前置處理。本系統的核心為馬可夫模型。我們由生物醫學文獻中萃取出內部特徵、外部特徵、及全域特徵,將這些特徵當成代表文字的特徵值。透過這些特徵值與馬可夫模型,我們可從未經處理過的文件中辨識出專有名詞。本論文提出了三種馬可夫模型分類器以供評估與比較。除了統計式的方法之外,我們亦使用歸納的經驗法則發掘出包含在含隱藏詞連接詞子句的專有名詞。實驗結果證明了我們所提出的方法之可行性。針對包含隱藏詞連接詞子句的分解,於165個含該句型的測試句中,可達到92%的求全率與46%的求準率。針對專有名詞的邊界標記,於1,685個測試句中,可達到72%的求全率與66%的求準率,針對蛋白質、去氧核糖核酸/核糖核酸、來源、與其他生物醫學專有名詞的分類,我們可達到63%的求全率與57%的求準率。
In this thesis, we proposed a hybrid automatic named entity extraction system applied on biomedical domain. We hope the system can be used as the front-end of the Information Extraction system for biomedical object relation extraction. The kernel of the system is based on Hidden Markov Models (HMMs). We extract internal feature, external feature, and global feature from the biomedical literature as its representative characteristics. With these features and our HMMs extractor, we recognize named entities from raw text. Three kinds of HMMs classifiers are built for evaluation and comparison. Besides statistical approach, we use heuristic rules to mine hiding named entities and expand them out of coordinated clauses with ellipsis. Experiment results are shown to prove the feasibility of the proposed approach. On 165 testing sentences containing ellipsis patterns, we achieve 92% recall and 46% precision expanding the coordinated clause with ellipsis. On 1,685 testing sentences, the proposed named entity extraction system obtains 72% recall and 66% precision for identifying the boundary of named entities and obtains 63% recall and 57 precision for categorizing the classes of Protein, DNA/RNA, Source, and Other biomedical entities.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT910394105
http://hdl.handle.net/11536/70272
顯示於類別:畢業論文