標題: | 以權重學習與知識擷取為基礎之中文指代消解研究 Chinese Anaphora Resolution Based on Weight Learning and Knowledge Acquisition |
作者: | 吳典松 Wu, Dian-Song 梁婷 Liang, Tyne 資訊科學與工程研究所 |
關鍵字: | 指代消解;特徵權重學習;知識擷取;網路探勘;Anaphora Resolution;Feature Weight Learning;Knowledge Acquisition;Web Mining |
公開日期: | 2010 |
摘要: | 指代是一種常見的語言現象,用於避免篇章中相同敘述的重複。指代消解是指在篇章中辨識指代詞所對應的先行詞的程序。指代消解在許多自然語言處理的應用中扮演著不可或缺的角色,例如機器翻譯、文件摘要及資訊萃取。
在相關研究中,指代消解的方法多依靠語法規則、語意或語用的線索來辨識指代詞,而近年來多以統計或分類方法為研究方向。然而,在以規則為基礎的方法中,特徵分數的選取多依靠人工的方式來指定權重值,錯誤會因為主觀性的偏見而產生。另一方面,在以分類為基礎的方法中,每個候選詞在做選擇時彼此間是視為獨立的關係,因而無法獲得相對的偏好程度。為了克服這些問題,我們提出以權重學習與知識擷取為基礎之中文指代消解方法。
在本論文中,我們針對中文文件中的代名詞指代、零指代以及限定性名詞指代進行處理,並且根據個別性質提出不同的方法。我們使用詞彙知識擷取和特徵值測量來消解代名詞指代,詞彙知識擷取以抽取相關語意特徵為目的,例如,性別、數量及搭配相容性。特徵值測量則是以亂度為基礎的權重分配來選取先行詞。在1343個指代實例中進行實驗顯示,我們所提出的方法相對於以規則為基礎的方法獲得7%的改善,消解成功率為82.5%。
在零指代消解問題中,我們應用案例式推理及樣式概念化來克服建構推論機制及詞彙特徵不足的問題。在1051個指代實例中進行實驗顯示獲得的F-score為79%,相對於以重心理論為基礎的方法獲得13%的改善。
在限定性名詞指代消解問題中,我們使用特徵值測量的方式將所有候選詞同時進行評估,另外也利用以網頁搜尋為基礎的方法加上外部詞典的輔助,來進行語意相容性的判別。在426個指代實例中進行實驗顯示,我們所提出的方法相對於以分類器為基礎的方法獲得4.7%的改善,消解成功率為72.5%。 Anaphora is a commonly observed linguistic phenomenon and used to avoid repetition of expressions in discourses. Anaphora resolution denotes the process of identifying the antecedent of an anaphor in a context. Effective anaphora resolution plays an essential role in many applications of natural language processing such as machine translation, summarization, and information extraction. In previous research, anaphora resolution methods have relied on syntactic rules, semantic or pragmatic clues to identify the antecedent. More recently, statistical-based or classification-based approaches are focused. However, in a rule-based approach, a salience score by manual weight assignment is usually adopted to select the antecedent. Errors may occur due to intuitive observations and subjective biases in selecting feature weight. On the other hand, the drawback of a classification-based approach is that it considers different candidates for the same anaphor independently. Thus it cannot effectively capture the preference relationships between competing candidates during resolution. To overcome these problems, we propose Chinese anaphora resolution methods based on weight learning and knowledge acquisition. In this thesis, pronominal, zero, and definite anaphora in Chinese texts are addressed and different approaches are presented. We use lexical knowledge acquisition and salience measurement to resolve Chinese pronominal anaphora. The lexical knowledge acquisition is aimed to extract more semantic features, such as gender, number, and collocate compatibility. The presented salience measurement is based on entropy-based weighting on selecting antecedent candidates. The experimental results show that our proposed approach yields 82.5% success rate on 1343 anaphoric instances, enhancing 7% improvement while compared with the general rule-based approach presented. As to Chinese zero anaphora, we apply case-based reasoning and pattern conceptualization to overcome the difficulties of constructing proper reasoning mechanisms and insufficiency of lexical features. The experimental results show that our proposed approach achieved competitive resolution by yielding 79% F-score on 1051 anaphoric instances and yielded 13% improvement while compared with the general rule-based approach. We use two strategies to resolve Chinese definite anaphors. One is an adaptive weight salience measurement in such a way that the entire set of candidates can be estimated simultaneously. Another scheme is a Web-based knowledge acquisition model so that semantic compatibility extraction and multiple resources can be employed. The experimental results show that our proposed approach yields 72.5% success rate on 426 anaphoric instances, enhancing 4.7% improvement while compared with the result conducted by a conventional classifier. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079223807 http://hdl.handle.net/11536/40415 |
顯示於類別: | 畢業論文 |