標題: | 統計式中文語言模型之初步探討 A First Study on Statistical Chinese Language Models |
作者: | 李安琪 Lee, An-Chi 陳信宏 Chen Sin-Horng 電信工程研究所 |
關鍵字: | 語言模型;雙連文;詞類;詞群;語料庫;詞庫;language model;bigram;part-of-speech;word-class;corpus;lexicon |
公開日期: | 1995 |
摘要: | 在本論文中﹐主要是研究中文語音辨認系統中音轉字的語言模型,包括詞 雙連文、詞類雙連文和詞群雙連文三種語言模型。我們以實作系統的觀點 ,分別對其語言參數的訓練方式及音轉字系統運作的效率進行評估,並建 立了初步的音轉字架構。在我們的實驗中,使用58362詞的詞庫,約700萬 詞的訓練語料庫和76萬字的測試語料庫,詞雙連文模型的平均音轉字正確 率為 94.7%,詞群雙連文模型為93.25%,而詞類雙連文則達91.3%,並 在加入破音字統計資訊後,正確率也有0.23%的提昇。另外,配合口語語 音聲調上的改變,我們也設計一判別法則,使音轉字系統更具包容性。 In the thesis, a first study on Chinese language model for syllable-to- character is presented. Three statistical models are discussed. First, a word bigram model with 58326 word entries is constructed using a large corpus containing about 5 million words. A POS bigram model with 46 POS entries is then constructed using a manually-tagged corpus containing about 2 million words.Lastly, the scheme using word-class bigram model is studied. An algorithm aiming at minimizing mutual information is employed to auto- matically generate all word classes by using the first corpus of 5 million words. A testing database containing about 760,000 characters was used to examine their performances. Character accuracy rates of 94.7%, 93.25% and 91.3% were obtained by these three models, respectively. Further improve- ments to consider the sandhi rule of Tone 3 change and the problem of Po- in character for monosyllabic words are also studied. Experimental results showed that slight performance improvements were achieved. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#NT840435009 http://hdl.handle.net/11536/60759 |
Appears in Collections: | Thesis |