標題: | 生物語料中蛋白質名稱之自動辨識 Automatic Protein Entities Recognition from PubMed Corpus |
作者: | 施並格 Ping-Ke Shih 梁婷 Tyne Liang 資訊科學與工程研究所 |
關鍵字: | 專有名詞辨識;生物醫學;馬可夫模型;Named Entity Extraction;Biomedical;Hidden Markov Model |
公開日期: | 2003 |
摘要: | 一般而言,專有名詞的語意辨識是建立專業知識庫自動化過程的一項基本且重要的工作。此種語意辨識方法可以分為規則式與統計式兩種。在本篇論文,我們分別檢視這兩種方法在生物領域上的效果。規則式的方法以核心詞、功能詞、及已定義詞為基礎,配合詞性標記來辨識蛋白質名稱,再利用六條規則來提升系統的效能,實驗針對GENIA 及 SwissProt Reference語料作測試,規則式的系統分別可以達到52%、51%的F分數。統計式的方法利用萃取出的內部特徵、外部特徵、及全域特徵,以簡潔的馬可夫模型為基礎,並配合back-off的機率模型以解決資料稀疏的問題,實驗同樣針對GENIA 及 SwissProt Reference語料作測試,統計式的系統皆可以達到77%的F分數。除此之外,我們亦使用歸納的經驗法則來發掘出在變化詞中的省略詞彙,實驗結果可得到89%的求全率與69%的求準率。 Named Entity Recognition (NER) is an essential task of knowledge acquisition. Recently NER has been widely applied in biomedical entities extraction. In this thesis, we proposed automatic protein entities recognition based on rule-based and statistical approaches. Rule-based approach relies on core terms, function terms, predefined terms and Part-of-Speech tags. Then six rules are applied to boost performance. The experiments with GENIA and SwissProt Reference corpus, rule-based approach can yield 52% and 51% F-score respectively. Statistical approach is based on concise Hidden Markov Model, and back-off models are conducted to overcome data sparseness problem. We use not only internal, external, global features but also the result of rule-based approach to identify protein entities. Statistical approach can yield 77% F-score in both GENIA and SwissProt Reference corpus. Besides, we use heuristic rules to mine hiding named entities and expand them out of coordination variants. Term variants resolution system can yield 89% recall and 69% precision. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT009123535 http://hdl.handle.net/11536/52902 |
Appears in Collections: | Thesis |
Files in This Item:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.