中文詞群雙連文語言模型之初步研究

標題:	中文詞群雙連文語言模型之初步研究 A Firest Study on Mandarin Word-class Bigram Language Model
作者:	楊育菁 Yang, Yu-Ching 陳信宏 Chen, Sin-Horng 電信工程研究所
關鍵字:	中文詞群;雙連文
公開日期:	1997
摘要:	本論文的研究重點在於中文語音辨認系統中語言解碼的語言模型。我們以實作系統的觀點，分別對詞群雙連文語言模型參數的訓練及語言解碼系統的運作進行研究。在本論文中，我們首先針對語言模型的建立，在兼顧文法結構下，加入中文的特殊構詞特性，利用詞與前後相連的語法變化設計一套詞群雙連文模型。另外建立了初步的語步的語言解碼系統，使用111246詞的詞庫及約900萬詞的語料庫，建立語言模型，再結合聲學解碼系統，針對一套平衡語料句加上節錄報紙文章的長短句的語音資料庫，經過傳統的HMM辦認法對測試語音作辨認，得到音節辨認率為81%的基本音節串，產生格狀音節組，最後進入語言解碼系統做最後的辨認。得到的基本辦認率為57.69%並且，在加入專有名詞辭庫、數詞構詞規則、詞類考量後，辦認率可達64.40%。 In this thesis, a word-class bigram of Chinese is discussed for speech-to-text conversion . An algorithm is first proposed to partition all words of a large lexicon containing 111246 word entries into several hundreds of word classes. It considers many linguists features of word inchuding part-of-speech, prefix, suffix, and length to make words with same characteristics being clustered together. Then a word-class bigram model is constructed using a text-corpus containing 9 million wors.Performance of the proposed word-class bigram model was examined by simulation to combine it with a HMM-based base-syllable recognier for converting speech into text. The base-syllable accuracy rate of the HMM recognizer was 81%. A character accuracy rate of 57.7% was achieved for the baseline system. By further including all proper nouns and some information rules for compound words, the accuracy rate raised to 64.4%.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT863435024 http://hdl.handle.net/11536/63468
Appears in Collections:	Thesis