標題: | 自動摘要系統基於AdaBoost Automatic Summarization System based on AdaBoost |
作者: | 鍾喻安 Chung, Yu-An 李嘉晃 Lee, Chia-Hoang 資訊科學與工程研究所 |
關鍵字: | 摘要;Summarization;AdaBoost |
公開日期: | 2009 |
摘要: | 隋著科技和網路的發展,網路上的資訊以指數倍數的速度成長,搜尋引擎雖然可以幫我們找到相關的資訊,但是往往符合查詢條件的結果還是數以千計;藉由自動摘要的發展,可以自動從大量的文件和資料中,取得讀者所想要得到的資訊,以方便讀者閱覽。
一般使用機器學習方法於摘要問題上,大多為單一機器學習的摘要方法,這些方法主要是使用標題、相似度、或者詞彙的重要程度…等特徵,去選取文章中的重點句子;不同於傳統機器學習方法,整體學習法則可被視為meta-algorithm,可以同時合併各種不同之演算法,以產生更好的分類結果。
本論文應用了AdaBoost演算法於摘要問題上。AdaBoost是一種群體學習演算法,它提供了一個設計架構,允許系統設計多個弱分類法,其中這些弱分類法必須有50%以上的正確率。目的是藉由分類問題和已知的分類結果訓練出一組較好的弱分類法集合與此集合裡弱分類法各自的權重,最後再予以合併,形成一強分類器。在本系統裡,我們根據文件中重要特徵資訊設計弱分類法,應用AdaBoost於中文之文件摘要,本實驗結果在壓縮率15%、20%、30%摘要,準確率分別為50.0182264%、48.1455086%、49.5370552%,已有將近五成的正確率。
英文部分,ROUGE-1、ROUGE-2、ROUGE-SU4準確率分別為41.201% 、10.003%、 14.845%。 Due to the rapid advancement of digital technology in the last two decades, there has been an increasingly large amount of digital content available on the Web. The enormous and continuously growing volume of data necessitates the development of efficient and effective text summarization systems. In this paper, we propose to employ AdaBoost to perform news summarization task. One of the features of AdaBoost is that it allows the system to incorporate many rules of thumb into the system and it can adaptively change the weightings of these rules. When the training process is completed, the system can employ the linear combination of weak classifiers with weightings to construct a strong classifier. We take into account several features to design weak classifiers. In system performance evaluation, we collected 200 news from 4 different categories as the data set and performed the experiments under different compression rates. The experiment results show that our system works stably and the average F-values are 0.5002, 0.4815 and 0.4954 under 15%, 20%, and 30% compression rates. The experiment results show that AdaBoost with weak classifiers can outperform the systems using SVMs and SVR in news summarization application. Meanwhile, we also apply our system to English text summarization corpus, which is DUC2002, and the recall of ROUGE-1、ROUGE-2、ROUGE-SU4 are 41.201% 、10.003%、 14.845%. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079755628 http://hdl.handle.net/11536/45973 |
顯示於類別: | 畢業論文 |