標題: | 生物文獻中同指涉問題處理之研究 Coreference Resolution in Biomedical Literature |
作者: | 林裕祥 梁婷 資訊科學與工程研究所 |
關鍵字: | 同指涉;指代現象;指代詞;名詞指代;縮寫;coreference;anaphora;pronominal;sortal;abbreviation |
公開日期: | 2004 |
摘要: | 同指涉消解需要處理指代現象消解和縮寫鏈結串聯。我們使用規則式處理縮寫問題,這規則式處理法則包含七條規則和使用了名詞片語辨識器(NP-chunker)來辨識縮寫和縮寫的原型。我們可以處理縮寫問題達到97%正確率和88%的招回率。除了縮寫問題,我們處理了在生物文獻中常見的代名詞指代和名詞指代詞問題。處理機制裡加入了知識本體(UMLS)和從生物文獻中探勘出來的SA/AO (subject-action/action-object)樣板。在此同時,對於名詞指代現象中未知詞使用了從UMLS中收集的中心詞(headword)和從PubMed中探勘的樣板。我們用基因演算法所得出了最佳特徵值給分機制,來決定指代詞和和它先行詞的關係。與其它方法在相同語料(MEDLINE摘要)做比較,所提的方法處理指代詞指代現象可達到92% F-Scorec和名詞指代現象可達到78% F-Score。 Coreference resolution involves anaphora and abbreviation linkage. To handle abbreviations, we use a rule-based resolution which concerns seven rules with the help of a NP-chunker to identify abbreviation and its long form. Our abbreviation resolution can achieve 97% in precision and 88% in recall. On the other hand, we address pronominal and sortal anaphora, which are common in biomedical texts. The resolution was achieved by employing the UMLS ontology and SA/AO (subject-action/action-object) patterns mined from biomedical corpus. On the other hand, sortal anaphora for unknown words was tackled by using the headword collected from UMLS and the patterns mined from PubMed. The final set of antecedents finding was decided with a salience grading mechanism, which was tuned by a genetic algorithm at its best-input feature selection stage. Compared to previous approaches on the same MEDLINE abstracts, the presented resolution was promising for its 92% F-Score in pronominal anaphora and 78% F-Score in sortal anaphora. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT009123534 http://hdl.handle.net/11536/52891 |
顯示於類別: | 畢業論文 |