Title: 統計式中文語言模型之初步探討
A First Study on Statistical Chinese Language Models
Authors: 李安琪
Lee, An-Chi
陳信宏
Chen Sin-Horng
電信工程研究所
Keywords: 語言模型;雙連文;詞類;詞群;語料庫;詞庫;language model;bigram;part-of-speech;word-class;corpus;lexicon
Issue Date: 1995
Abstract: 在本論文中﹐主要是研究中文語音辨認系統中音轉字的語言模型,包括詞
雙連文、詞類雙連文和詞群雙連文三種語言模型。我們以實作系統的觀點
,分別對其語言參數的訓練方式及音轉字系統運作的效率進行評估,並建
立了初步的音轉字架構。在我們的實驗中,使用58362詞的詞庫,約700萬
詞的訓練語料庫和76萬字的測試語料庫,詞雙連文模型的平均音轉字正確
率為 94.7%,詞群雙連文模型為93.25%,而詞類雙連文則達91.3%,並
在加入破音字統計資訊後,正確率也有0.23%的提昇。另外,配合口語語
音聲調上的改變,我們也設計一判別法則,使音轉字系統更具包容性。
In the thesis, a first study on Chinese language model for
syllable-to- character is presented. Three statistical
models are discussed. First, a word bigram model with 58326
word entries is constructed using a large corpus
containing about 5 million words. A POS bigram model with 46 POS
entries is then constructed using a manually-tagged corpus
containing about 2 million words.Lastly, the scheme using
word-class bigram model is studied. An algorithm aiming at
minimizing mutual information is employed to auto- matically
generate all word classes by using the first corpus of 5 million
words. A testing database containing about 760,000 characters
was used to examine their performances. Character accuracy
rates of 94.7%, 93.25% and 91.3% were obtained by these
three models, respectively. Further improve- ments to
consider the sandhi rule of Tone 3 change and the problem of Po-
in character for monosyllabic words are also studied.
Experimental results showed that slight performance improvements
were achieved.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT840435009
http://hdl.handle.net/11536/60759
Appears in Collections:Thesis