標題: 中文文本分析技術的開發和應用
Chinese Textual Analysis Techniques and Their Applications
作者: 梁婷
LIANG TYNE
國立交通大學資訊工程學系(所)
關鍵字: 自然語言處理;中文;文本分析;語篇分析;作文評量;Natural language processing;Chinese text;textual analysis;discourse analysis;essay scoring
公開日期: 2008
摘要: 中文文本分析技術的開發和應用(II) (III) 隨著華語文學習與使用在全球各地日受重視,中文語言處理技術的開發也愈趨重 要。目前在中文的文本分析中多著重在詞組或小句的處理上,對於句群、段落間等語義 處理尚未見廣泛的討論。因此在本計劃中,我們除了延續笫一期所進行的指代消解研究 外,我們將在第二期針對中文書面語的語篇連貫結構分析處理,提出有效的計算模組。 此外,我們也將應用所發展的文本分析技術,在第三期進行中文作文評分與分析糸統的 開發與建置。 一般而言,有效的語篇分析有助於釐清文章的論題或邏輯結構。在本計劃中,我們 將針對常見的如「並列」、「承接」、「遞進」、「轉折」等九種語篇連貫結構提出有效的標 記程序,並建立一個可資參考評量的語篇標記語料。所提的標記程序將建立在以語料為 主的分析方法上,以進行如線索詞、連續詞性、特殊標點符號等表層特徵探勘。我們將 就不同文體(記敘文、說明文、議論文等)的語料,分別就法則式和機器學習標記程序 進行效能檢驗,並分析不同文體的文章其連貫關係的相依結構和序列分佈,進行深入的 探討。 在第三期計劃,我們參考專家學者對作文評分標準的研究,提出一個作文分析與評 量的糸統,以期對作文者和評量者有所助益。此糸統主要的模組包括語料收集與整理、 文本處理、文章內容分析、結構組織分析、語言能力分析、評分機制分析、外部知識庫 和輔助查詢工具的建置。其中文章內容分析模組進行作文題目的主題處理及作文內容的 合題性檢查;結構組織分析模組進行文章內容的連貫性、語篇修辭結構的完整與多樣 性、及文章表面結構的處理;語言使用分析模組進行作者對語言掌握程度的分析,包括 從用字遣詞、詞組共現、到句法分析。此外,我們將分別設計機器學習方法和法則式的 給分機制,對各評量項目參考專家學者建立各級分分類器及相對應的評語表。我們並整 合之前所做的輔助工具包括共現詞組萃取、相似句檢索、語篇連貫結構檢索等,以供華 語學習者案例查詢。我們相信本計劃的執行不僅對中文文本結構有深入的探討和創新的 法則提出,對資訊自動化技術的發展、和中文學習者亦有所幫助。
Chinese Textual Analysis Techniques and TheirApplications (II) (III) As learning Chinese becomes popular in world, exploration of efficient Chinese processing techniques becomes more important than ever. However, most of Chinese processing techniques proposed in recent literature focus on the analysis at term or clause levels, lack of wide discussion on textual structure. Hence this project will address the issue of discourse structure analysis and will propose efficient procedures to recognize different discourse rhetorical structures and present a well-tagged evaluation corpus for discourse structure research. In addition, this project will propose automatic essay scoring and analysis system by applying the proposed Chinese textual analysis techniques. In fact, appropriate discourse analysis is helpful in understanding the logical structure of an essay. Hence, this project will address the nine types of discourse structures which are common in Chinese texts. For example, 「elaboration」, 「cause-effect」, 「similarity/contrast』, …etc. We will present both rule-based and machine learning approaches for tagging the nine types of relation structures by employing the features extracted from a balanced corpus. In addition, we will investigate the relation sequences and their dependencies in various discourse structures. On the other hand, an automatic essay scoring and analysis system is proposed with the aim to be helpful to Chinese learners and essay evaluator as well. The proposed system will be composed of the modules for essay corpus construction, basic textual processing, content analysis related to the agreement between title and essay, organization analysis for coherence structure and layout structure, language capability analysis from terms to sentences, essay scoring, and the construction for external resources and assistant tools. It is believed that the implementation of this project not only helps the exploration of better and novel techniques for analyzing Chinese texts but also facilitates the enhancement of Chinese information automation.
官方說明文件#: NSC96-2221-E009-168-MY2
URI: http://hdl.handle.net/11536/102777
https://www.grb.gov.tw/search/planDetail?id=1592492&docId=273169
顯示於類別:研究計畫