統計式中文語言模型之初步探討

Full metadata record

DC Field	Value	Language
dc.contributor.author	李安琪	en_US
dc.contributor.author	Lee, An-Chi	en_US
dc.contributor.author	陳信宏	en_US
dc.contributor.author	Chen Sin-Horng	en_US
dc.date.accessioned	2014-12-12T02:15:42Z	-
dc.date.available	2014-12-12T02:15:42Z	-
dc.date.issued	1995	en_US
dc.identifier.uri	http://140.113.39.130/cdrfb3/record/nctu/#NT840435009	en_US
dc.identifier.uri	http://hdl.handle.net/11536/60759	-
dc.description.abstract	在本論文中﹐主要是研究中文語音辨認系統中音轉字的語言模型，包括詞雙連文、詞類雙連文和詞群雙連文三種語言模型。我們以實作系統的觀點，分別對其語言參數的訓練方式及音轉字系統運作的效率進行評估，並建立了初步的音轉字架構。在我們的實驗中，使用58362詞的詞庫，約700萬詞的訓練語料庫和76萬字的測試語料庫，詞雙連文模型的平均音轉字正確率為 94.7％，詞群雙連文模型為93.25％，而詞類雙連文則達91.3％，並在加入破音字統計資訊後，正確率也有0.23％的提昇。另外，配合口語語音聲調上的改變，我們也設計一判別法則，使音轉字系統更具包容性。 In the thesis, a first study on Chinese language model for syllable-to- character is presented. Three statistical models are discussed. First, a word bigram model with 58326 word entries is constructed using a large corpus containing about 5 million words. A POS bigram model with 46 POS entries is then constructed using a manually-tagged corpus containing about 2 million words.Lastly, the scheme using word-class bigram model is studied. An algorithm aiming at minimizing mutual information is employed to auto- matically generate all word classes by using the first corpus of 5 million words. A testing database containing about 760,000 characters was used to examine their performances. Character accuracy rates of 94.7%, 93.25% and 91.3% were obtained by these three models, respectively. Further improve- ments to consider the sandhi rule of Tone 3 change and the problem of Po- in character for monosyllabic words are also studied. Experimental results showed that slight performance improvements were achieved.	zh_TW
dc.language.iso	zh_TW	en_US
dc.subject	語言模型	zh_TW
dc.subject	雙連文	zh_TW
dc.subject	詞類	zh_TW
dc.subject	詞群	zh_TW
dc.subject	語料庫	zh_TW
dc.subject	詞庫	zh_TW
dc.subject	language model	en_US
dc.subject	bigram	en_US
dc.subject	part-of-speech	en_US
dc.subject	word-class	en_US
dc.subject	corpus	en_US
dc.subject	lexicon	en_US
dc.title	統計式中文語言模型之初步探討	zh_TW
dc.title	A First Study on Statistical Chinese Language Models	en_US
dc.type	Thesis	en_US
dc.contributor.department	電信工程研究所	zh_TW
Appears in Collections:	Thesis