标题: | 以权重学习与知识撷取为基础之中文指代消解研究 Chinese Anaphora Resolution Based on Weight Learning and Knowledge Acquisition |
作者: | 吴典松 Wu, Dian-Song 梁婷 Liang, Tyne 资讯科学与工程研究所 |
关键字: | 指代消解;特征权重学习;知识撷取;网路探勘;Anaphora Resolution;Feature Weight Learning;Knowledge Acquisition;Web Mining |
公开日期: | 2010 |
摘要: | 指代是一种常见的语言现象,用于避免篇章中相同叙述的重复。指代消解是指在篇章中辨识指代词所对应的先行词的程序。指代消解在许多自然语言处理的应用中扮演着不可或缺的角色,例如机器翻译、文件摘要及资讯萃取。 在相关研究中,指代消解的方法多依靠语法规则、语意或语用的线索来辨识指代词,而近年来多以统计或分类方法为研究方向。然而,在以规则为基础的方法中,特征分数的选取多依靠人工的方式来指定权重值,错误会因为主观性的偏见而产生。另一方面,在以分类为基础的方法中,每个候选词在做选择时彼此间是视为独立的关系,因而无法获得相对的偏好程度。为了克服这些问题,我们提出以权重学习与知识撷取为基础之中文指代消解方法。 在本论文中,我们针对中文文件中的代名词指代、零指代以及限定性名词指代进行处理,并且根据个别性质提出不同的方法。我们使用词汇知识撷取和特征值测量来消解代名词指代,词汇知识撷取以抽取相关语意特征为目的,例如,性别、数量及搭配相容性。特征值测量则是以乱度为基础的权重分配来选取先行词。在1343个指代实例中进行实验显示,我们所提出的方法相对于以规则为基础的方法获得7%的改善,消解成功率为82.5%。 在零指代消解问题中,我们应用案例式推理及样式概念化来克服建构推论机制及词汇特征不足的问题。在1051个指代实例中进行实验显示获得的F-score为79%,相对于以重心理论为基础的方法获得13%的改善。 在限定性名词指代消解问题中,我们使用特征值测量的方式将所有候选词同时进行评估,另外也利用以网页搜寻为基础的方法加上外部词典的辅助,来进行语意相容性的判别。在426个指代实例中进行实验显示,我们所提出的方法相对于以分类器为基础的方法获得4.7%的改善,消解成功率为72.5%。 Anaphora is a commonly observed linguistic phenomenon and used to avoid repetition of expressions in discourses. Anaphora resolution denotes the process of identifying the antecedent of an anaphor in a context. Effective anaphora resolution plays an essential role in many applications of natural language processing such as machine translation, summarization, and information extraction. In previous research, anaphora resolution methods have relied on syntactic rules, semantic or pragmatic clues to identify the antecedent. More recently, statistical-based or classification-based approaches are focused. However, in a rule-based approach, a salience score by manual weight assignment is usually adopted to select the antecedent. Errors may occur due to intuitive observations and subjective biases in selecting feature weight. On the other hand, the drawback of a classification-based approach is that it considers different candidates for the same anaphor independently. Thus it cannot effectively capture the preference relationships between competing candidates during resolution. To overcome these problems, we propose Chinese anaphora resolution methods based on weight learning and knowledge acquisition. In this thesis, pronominal, zero, and definite anaphora in Chinese texts are addressed and different approaches are presented. We use lexical knowledge acquisition and salience measurement to resolve Chinese pronominal anaphora. The lexical knowledge acquisition is aimed to extract more semantic features, such as gender, number, and collocate compatibility. The presented salience measurement is based on entropy-based weighting on selecting antecedent candidates. The experimental results show that our proposed approach yields 82.5% success rate on 1343 anaphoric instances, enhancing 7% improvement while compared with the general rule-based approach presented. As to Chinese zero anaphora, we apply case-based reasoning and pattern conceptualization to overcome the difficulties of constructing proper reasoning mechanisms and insufficiency of lexical features. The experimental results show that our proposed approach achieved competitive resolution by yielding 79% F-score on 1051 anaphoric instances and yielded 13% improvement while compared with the general rule-based approach. We use two strategies to resolve Chinese definite anaphors. One is an adaptive weight salience measurement in such a way that the entire set of candidates can be estimated simultaneously. Another scheme is a Web-based knowledge acquisition model so that semantic compatibility extraction and multiple resources can be employed. The experimental results show that our proposed approach yields 72.5% success rate on 426 anaphoric instances, enhancing 4.7% improvement while compared with the result conducted by a conventional classifier. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079223807 http://hdl.handle.net/11536/40415 |
显示于类别: | Thesis |
文件中的档案:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.