標題: 中文主題詞辨識與其應用
Topic Recognition and Its Application to Chinese Texts
作者: 潘善均
Shan-Chun Pan
梁婷
Tyne Liang
資訊科學與工程研究所
關鍵字: 中文主題辨識;主題特徵;離題偵測;連貫性評量;Chinese topic recognition;topic feature analysis;off-topic detection;coherence evaluation
公開日期: 2006
摘要: 主題詞辨識是文本理解中一項不可缺乏的工作,它可以釐清文本的核心敘述,進而應用在文章的主題偵測與作文的評分上。本論文首先以重心理論為基礎的方式取得小句重心,再以小句重心作為候選詞,依照主題的各個特徵辨識長句的主題詞,此法並不需任何的訓練語料。最後我們將長句主題詞運用至學生作文的離題偵測上,將小句重心運用至連貫性評量上。我們使用11篇平均字數為1500字的報紙社論文章進行主題詞辨識的驗證,針對包含主題各種特徵的實驗模組加以測試,社論文章的主題詞辨識可達86.84%的正確率,召回率為68.51%。我們另外蒐集95篇400字的學生作文進行主題詞辨識、離題偵測、以及連貫性評量的實驗,學生作文的主題詞辨識可達80.86%的正確率,召回率為71.36%。在離題偵測上,離題文章判別的正確率可達到63.36%,召回率為77.77%。本論文嘗試以長句主題詞來作離題偵測,雖可解決以文章全部詞彙來偵測離題的困難,但尚存有無法解決的問題,例如系統無法辨別學生認知概念上的離題,或者引用新穎的例證而造成系統誤判為離題。
Topic recognition is an essential part of document understanding and can help people to quickly understand the core description of the document. It can be applied in topic detection and essay scoring. In this paper, we developed an algorithm to extract the topic from a Chinese sentence. First, we used Centering Theory-based algorithm to center each clauses. Second, we took those centers as candidates and extracted their features to generate a topic in a Chinese sentence. Then, we used those sentence topics to detect off-topic essays, and evaluated essay coherence by clause centers. We collected 11 news editorial articles, each of which contains around 1500 words, as our topic recognition corpus. We also collected another 95 400-words essays written by students to generate sentence topics, detected off-topic essays, and evaluated essay coherence. In our experiment, the precision and recall of topic recognition in editorial articles achieve 86.84% and 68.51%. In students’ essays, the precision and recall of topic recognition are 80.86% and 71.36%. In off-topic detection experiment, we can achieve 63.36% precision and 77.77% recall. Our method overcame some problems in using bag-of-words to detect off-topic essays, but still remained some difficulties that can not be solved. We can not detect the misunderstanding of students’ thought, and we also wrongly detected novel ideas given in students’ essay as an off-topic sentence.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009455511
http://hdl.handle.net/11536/82039
顯示於類別:畢業論文


文件中的檔案:

  1. 551101.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。