摘錄式多文件自動化摘要方法之研究

標題:	摘錄式多文件自動化摘要方法之研究 A Study on Extraction-based Multidocument Summarization
作者:	葉鎮源 Jen-Yuan Yeh 楊維邦柯皓仁 Wei-Pang Yang Hao-Ren Ke 資訊科學與工程研究所
關鍵字:	多文件摘要;一般性摘要;以查詢為導向之摘要;語句排序;語句摘錄;重複性資訊過濾;multidocument summarization;generic summary;query-focused summary;sentence ranking;sentence extraction;redundancy filtering
公開日期:	2007
摘要:	隨著資訊科技的快速發展，線上資訊的量及其可得性已大幅地增長。資訊爆炸導致產生資訊超載的現象，如何有效率地取得且有效地利用所需的資訊，已儼然成為人們生活中必須面對的迫切問題。文件自動化摘要(Text Summarization)技術由電腦分析文件內容，擷取出重要的資訊，並以摘要的形式呈現。此技術可以幫助人們處理資訊，於短時間內了解文件的內容，以作為決策的參考。本論文探討多文件自動化摘要的方法，研究主題包含：(1) 多文件摘要(Multidocument Summarization)與(2) 以查詢為導向之多文件摘要(Query-focused Multidocument Summarization)。多文件摘要乃是從多篇主題相關的文件中產生單篇摘要；以查詢為導向之多文件摘要則是從多篇主題相關的文件中擷取與使用者興趣相關的內容，並依此產生單篇摘要。本論文採用語句摘錄(Sentence Extraction)的方法，判別語句的重要性，並逐字摘錄重要的語句以產生摘錄式摘要。其中，本論文的重點為語句重要性的計量及語句排序方法的研究。針對多文件摘要，本論文提出一套以圖形為基礎的語句排序(Sentence Ranking)方法：iSpreadRank。此方法建構語句關係網路(Sentence Similarity Network)作為分析多文件的模型，並採用擴散激發理論(Spreading Activation)推導語句的重要性作為排序的依據。接著，依序挑選重要的語句以形成摘要；挑選語句時，以與先前被挑選的語句具較低資訊重複者為優先。實驗中，將此摘要方法應用於DUC 2004的資料集。評估結果顯示，相較於DUC 2004當年度競賽的系統，本論文所提出的方法於ROUGE基準上有良好的表現。針對以查詢為導向之多文件摘要，本論文結合：(1) 語句與查詢主題的相似度與(2) 語句的資訊代表性，提出一套語句重要性的計量方法。其中，利用潛在語意分析(Latent Semantic Analysis)，以計算語句與查詢主題於語意空間的相似度；並採用傳統摘要方法中探討語句代表性的特徵(Surface-level Features)，以評量語句的資訊代表性。本論文亦基於Maximum Marginal Relevance技術，考量資訊的重複性，提出一個適用於以查詢為導向之多文件摘要的語句摘錄方法。實驗中，將此摘要方法應用於DUC 2005的資料集。評估結果顯示，相較於DUC 2005當年度競賽的系統，本論文所提出的方法於ROUGE基準上有良好的表現。 The rapid development of information technology over the past decades has dramatically increased the amount and the availability of online information. The explosion of information has led to information overload, implying that finding and using the information that people really need efficiently and effectively has become a pressing practical problem in people’s daily life. Text summarization, which can automatically digest information content from document(s) while preserving the underlying main points, is one obvious technique to help people interact with information. This thesis discusses work on summarization, including: (1) multidocument summarization, and (2) query-focused multidocument summarization. The first is to produce a generic summary of a set of topically-related documents. The second, a particular task of the first, is to generate a query-focused summary, which reflects particular points that are relevant to the user’s desired topic(s) of interest. Both tasks are addressed using the most common technique for summarization, namely sentence extraction: important sentences are identified and extracted verbatim from documents and composed into an extractive summary. The first step towards sentence extraction is obviously to score and rank sentences in order of importance, which is the major focus of this thesis. In the first task, a novel graph-based sentence ranking method, iSpreadRank, is proposed to rank sentences according to their likelihood of being part of the summary. The input documents are modeled as a sentence similarity network. iSpreadRank practically applies spreading activation to reason the relative importance of sentences based on the network structure. It then iteratively extracts one sentence at a time into the summary, which not only has high importance but also has low redundancy with the sentences extracted prior to it. The proposed summarization method is evaluated using the DUC 2004 data set and found to perform well in various ROUGE measures. Experimental results show that the proposed method is competitive to the top systems at DUC 2004. In the second task, a new scoring method, which combines (1) the degree of relevance of a sentence to the query, and (2) the informativeness of a sentence, is proposed to measure the likelihood of sentences of being part in the summary. While the degree of query relevance of a sentence is assessed as the similarity between the sentence and the query computed in a latent semantic space, the informativeness of a sentence is estimated using surface-level features. Moreover, a novel sentence extraction method, inspired by maximal marginal relevance (MMR), is developed to iteratively extract one sentence at a time into the summary, if it is not too similar to any sentences already extracted. The proposed summarization method is evaluated using the DUC 2005 data set and found to perform well in various ROUGE measures. Experimental results show that the proposed method is competitive to the top systems at DUC 2005.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT009123805 http://hdl.handle.net/11536/53713
Appears in Collections:	Thesis

Files in This Item:

380501.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.