标题: 摘录式多文件自动化摘要方法之研究
A Study on Extraction-based Multidocument Summarization
作者: 叶镇源
Jen-Yuan Yeh
杨维邦
柯皓仁
Wei-Pang Yang
Hao-Ren Ke
资讯科学与工程研究所
关键字: 多文件摘要;一般性摘要;以查询为导向之摘要;语句排序;语句摘錄;重复性资讯过濾;multidocument summarization;generic summary;query-focused summary;sentence ranking;sentence extraction;redundancy filtering
公开日期: 2007
摘要: 随着资讯科技的快速发展,线上资讯的量及其可得性已大幅地增长。资讯爆炸导致产生资讯超载的现象,如何有效率地取得且有效地利用所需的资讯,已俨然成为人们生活中必须面对的迫切问题。文件自动化摘要(Text Summarization)技术由电脑分析文件内容,撷取出重要的资讯,并以摘要的形式呈现。此技术可以帮助人们处理资讯,于短时间内了解文件的内容,以作为决策的参考。

本论文探讨多文件自动化摘要的方法,研究主题包含:(1) 多文件摘要(Multidocument Summarization)与(2) 以查询为导向之多文件摘要(Query-focused Multidocument Summarization)。多文件摘要乃是从多篇主题相关的文件中产生单篇摘要;以查询为导向之多文件摘要则是从多篇主题相关的文件中撷取与使用者兴趣相关的内容,并依此产生单篇摘要。本论文采用语句摘录(Sentence Extraction)的方法,判别语句的重要性,并逐字摘录重要的语句以产生摘录式摘要。其中,本论文的重点为语句重要性的计量及语句排序方法的研究。

针对多文件摘要,本论文提出一套以图形为基础的语句排序(Sentence Ranking)方法:iSpreadRank。此方法建构语句关系网路(Sentence Similarity Network)作为分析多文件的模型,并采用扩散激发理论(Spreading Activation)推导语句的重要性作为排序的依据。接着,依序挑选重要的语句以形成摘要;挑选语句时,以与先前被挑选的语句具较低资讯重复者为优先。实验中,将此摘要方法应用于DUC 2004的资料集。评估结果显示,相较于DUC 2004当年度竞赛的系统,本论文所提出的方法于ROUGE基准上有良好的表现。

针对以查询为导向之多文件摘要,本论文结合:(1) 语句与查询主题的相似度与(2) 语句的资讯代表性,提出一套语句重要性的计量方法。其中,利用潜在语意分析(Latent Semantic Analysis),以计算语句与查询主题于语意空间的相似度;并采用传统摘要方法中探讨语句代表性的特征(Surface-level Features),以评量语句的资讯代表性。本论文亦基于Maximum Marginal Relevance技术,考量资讯的重复性,提出一个适用于以查询为导向之多文件摘要的语句摘录方法。实验中,将此摘要方法应用于DUC 2005的资料集。评估结果显示,相较于DUC 2005当年度竞赛的系统,本论文所提出的方法于ROUGE基准上有良好的表现。
The rapid development of information technology over the past decades has dramatically increased the amount and the availability of online information. The explosion of information has led to information overload, implying that finding and using the information that people really need efficiently and effectively has become a pressing practical problem in people’s daily life. Text summarization, which can automatically digest information content from document(s) while preserving the underlying main points, is one obvious technique to help people interact with information.

This thesis discusses work on summarization, including: (1) multidocument summarization, and (2) query-focused multidocument summarization. The first is to produce a generic summary of a set of topically-related documents. The second, a particular task of the first, is to generate a query-focused summary, which reflects particular points that are relevant to the user’s desired topic(s) of interest. Both tasks are addressed using the most common technique for summarization, namely sentence extraction: important sentences are identified and extracted verbatim from documents and composed into an extractive summary. The first step towards sentence extraction is obviously to score and rank sentences in order of importance, which is the major focus of this thesis.

In the first task, a novel graph-based sentence ranking method, iSpreadRank, is proposed to rank sentences according to their likelihood of being part of the summary. The input documents are modeled as a sentence similarity network. iSpreadRank practically applies spreading activation to reason the relative importance of sentences based on the network structure. It then iteratively extracts one sentence at a time into the summary, which not only has high importance but also has low redundancy with the sentences extracted prior to it. The proposed summarization method is evaluated using the DUC 2004 data set and found to perform well in various ROUGE measures. Experimental results show that the proposed method is competitive to the top systems at DUC 2004.

In the second task, a new scoring method, which combines (1) the degree of relevance of a sentence to the query, and (2) the informativeness of a sentence, is proposed to measure the likelihood of sentences of being part in the summary. While the degree of query relevance of a sentence is assessed as the similarity between the sentence and the query computed in a latent semantic space, the informativeness of a sentence is estimated using surface-level features. Moreover, a novel sentence extraction method, inspired by maximal marginal relevance (MMR), is developed to iteratively extract one sentence at a time into the summary, if it is not too similar to any sentences already extracted. The proposed summarization method is evaluated using the DUC 2005 data set and found to perform well in various ROUGE measures. Experimental results show that the proposed method is competitive to the top systems at DUC 2005.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009123805
http://hdl.handle.net/11536/53713
显示于类别:Thesis


文件中的档案:

  1. 380501.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.