標題: 中文斷詞器之改進
An Improvement on Chinese Parser
作者: 江振宇
Chen Yu Chiang
陳信宏
Sin-Horng Chen
電信工程研究所
關鍵字: 中文斷詞器;語音合成;斷詞單元;構詞單元;詞類標記;文字正規化;Chinese word tagger;Text-to-Speech;Word identification;Compound words;POS tagging;Text normalization
公開日期: 2003
摘要: 在本論文中,我們設計了中文斷詞器的基本架構,並實現了此中文斷詞器,以模組化的設計方法,使得整個斷詞器的架構更加系統化,可以成為一個語音合成系統的軟體開發元件,改善了先前中文斷詞器的架構問題。整個最核心的斷詞單元,採用規則法斷詞,並使用詞典樹增加詞典比對速度。構詞單元,我們採用中研院提供的構詞規則以及自行整理出之規則應用,並使構詞單元之程式處理效率最佳化。對於特殊符號的語音讀法,我們設計了文字正規化單元,解決特殊符號的讀法問題。為了瞭解斷詞器之性能,我們以〈中研院平衡語料庫3.0版〉做為測試語料,測試結果顯示斷詞的召回率達到0.78,精確率達到0.87,而詞類標記的精確率可以達到0.96。最後我們分析本斷詞器之斷詞結果,探討斷詞錯誤之可能更正方法。
In this thesis, a Chinese word tagger for text-to-speech (TTS) is implemented. It contains four basic modules. They are word identification module, word combination module, POS (part of speech) tagging module, and text normalization module. In word identification module, we adopt a word matching algorithm with 6 heuristic rules proposed by the Chinese Knowledge Information Processing group (CKIP), Academia Sinica, to identify words from input Chinese character string. The word combination module groups words into compounds using 95 determinative-measure (DM) compound rules and 10 reduplication rules. The POS tagging module gives POS tags to words identified by the word identification module. To transform from written form to spoken form, we design the text normalization module. Lastly, the Sinica Corpus published by CKIP is used to evaluate the performance of our system. We achieve a recall rate of 0.78, a precision rate of 0.87 in word identification, and a precision rate of 0.96 in POS tagging. We also analyze word identification results to give advices in future works.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009113504
http://hdl.handle.net/11536/45868
顯示於類別:畢業論文


文件中的檔案:

  1. 350401.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。