标题: 生物语料中蛋白质名称之自动辨识
Automatic Protein Entities Recognition from PubMed Corpus
作者: 施并格
Ping-Ke Shih
梁婷
Tyne Liang
资讯科学与工程研究所
关键字: 专有名词辨识;生物医学;马可夫模型;Named Entity Extraction;Biomedical;Hidden Markov Model
公开日期: 2003
摘要: 一般而言,专有名词的语意辨识是建立专业知识库自动化过程的一项基本且重要的工作。此种语意辨识方法可以分为规则式与统计式两种。在本篇论文,我们分别检视这两种方法在生物领域上的效果。规则式的方法以核心词、功能词、及已定义词为基础,配合词性标记来辨识蛋白质名称,再利用六条规则来提升系统的效能,实验针对GENIA 及 SwissProt Reference语料作测试,规则式的系统分别可以达到52%、51%的F分数。统计式的方法利用萃取出的内部特征、外部特征、及全域特征,以简洁的马可夫模型为基础,并配合back-off的机率模型以解决资料稀疏的问题,实验同样针对GENIA 及 SwissProt Reference语料作测试,统计式的系统皆可以达到77%的F分数。除此之外,我们亦使用归纳的经验法则来发掘出在变化词中的省略词汇,实验结果可得到89%的求全率与69%的求准率。
Named Entity Recognition (NER) is an essential task of knowledge acquisition. Recently NER has been widely applied in biomedical entities extraction. In this thesis, we proposed automatic protein entities recognition based on rule-based and statistical approaches. Rule-based approach relies on core terms, function terms, predefined terms and Part-of-Speech tags. Then six rules are applied to boost performance. The experiments with GENIA and SwissProt Reference corpus, rule-based approach can yield 52% and 51% F-score respectively. Statistical approach is based on concise Hidden Markov Model, and back-off models are conducted to overcome data sparseness problem. We use not only internal, external, global features but also the result of rule-based approach to identify protein entities. Statistical approach can yield 77% F-score in both GENIA and SwissProt Reference corpus. Besides, we use heuristic rules to mine hiding named entities and expand them out of coordination variants. Term variants resolution system can yield 89% recall and 69% precision.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009123535
http://hdl.handle.net/11536/52902
显示于类别:Thesis


文件中的档案:

  1. 353501.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.