統計式中文語言模型之初步探討

標題:	統計式中文語言模型之初步探討 A First Study on Statistical Chinese Language Models
作者:	李安琪 Lee, An-Chi 陳信宏 Chen Sin-Horng 電信工程研究所
關鍵字:	語言模型;雙連文;詞類;詞群;語料庫;詞庫;language model;bigram;part-of-speech;word-class;corpus;lexicon
公開日期:	1995
摘要:	在本論文中﹐主要是研究中文語音辨認系統中音轉字的語言模型，包括詞雙連文、詞類雙連文和詞群雙連文三種語言模型。我們以實作系統的觀點，分別對其語言參數的訓練方式及音轉字系統運作的效率進行評估，並建立了初步的音轉字架構。在我們的實驗中，使用58362詞的詞庫，約700萬詞的訓練語料庫和76萬字的測試語料庫，詞雙連文模型的平均音轉字正確率為 94.7％，詞群雙連文模型為93.25％，而詞類雙連文則達91.3％，並在加入破音字統計資訊後，正確率也有0.23％的提昇。另外，配合口語語音聲調上的改變，我們也設計一判別法則，使音轉字系統更具包容性。 In the thesis, a first study on Chinese language model for syllable-to- character is presented. Three statistical models are discussed. First, a word bigram model with 58326 word entries is constructed using a large corpus containing about 5 million words. A POS bigram model with 46 POS entries is then constructed using a manually-tagged corpus containing about 2 million words.Lastly, the scheme using word-class bigram model is studied. An algorithm aiming at minimizing mutual information is employed to auto- matically generate all word classes by using the first corpus of 5 million words. A testing database containing about 760,000 characters was used to examine their performances. Character accuracy rates of 94.7%, 93.25% and 91.3% were obtained by these three models, respectively. Further improve- ments to consider the sandhi rule of Tone 3 change and the problem of Po- in character for monosyllabic words are also studied. Experimental results showed that slight performance improvements were achieved.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT840435009 http://hdl.handle.net/11536/60759
Appears in Collections:	Thesis