中文大詞彙語音辨認之語言模型改進

標題:	中文大詞彙語音辨認之語言模型改進 Improvement on Language Modeling for Large-vocabulary Mandarin Speech Recognition
作者:	周建邦 Chou, Chien-Pang 陳信宏 Chen, Sin-Horng 電信工程研究所
關鍵字:	中文大詞彙;語音辨認;語言模型;Large-vocabulary Mandarin;Speech Recognition;Language Model
公開日期:	2009
摘要:	本研究之目的為探討中文大詞彙語音辨認之語言模型改進。傳統大詞彙語音辨認大多使用統計式語言模型，藉此計算數萬詞條之雙連文或三連文機率模型，然而，此方法仍有其缺失，因其無法對於不包含在辭典中之詞彙進行辨識，其中包含數量複合詞、專有名詞、不常出現之詞綴構詞等等，基此，本研究針對混合詞及半詞(subword)之統計式語言模型進行探討，期望藉此增進辭典之涵蓋率，降低無法進行辨識之詞條數目。本研究分為三大主軸，首先，對於文字資料庫進行前處理，針對不適當內容(英文、文章標題等)進行刪減、對於錯誤文字予以更正、斷詞、文字正規化等；其次，建構混合詞及半詞統計式語言模型，探討字典收錄詞條之策略、將辭典未收納之詞彙拆解為半詞之方法、以及混合模型之建立，最後，採用兩階段(two-stage)辨認架構，針對辨認方法及實驗結果進行說明，並進一步分析與比較架構式模型和傳統方法模型之語音辨認結果之優劣，針對本研究考量之三種構詞(人名、詞綴及數量複合詞)的辨識效益進行深入分析。為了驗證提出方法之效能，本研究採用TCC300麥克風語料為語音實驗語料，語言模型則由台灣光華雜誌(Taiwan Paramora)及中文檢索標竿(NTCIR3.0)文字語料庫求得，實驗結果顯示，相較於傳統採用之統計式語言模型，本研究所提出的混合模型對於大詞彙語音辨認系統效能有所改善，整體詞辨認率(word accuracy)由60.86%提升至62.85%，經過深入分析發現，使用所提出之兩階段辨認方法對於人名、詞綴及數量複合詞確實有所幫助，此三類辨認正確之數量增加驗證了提出方法的有效性。 The purpose of this research is the improvement of language modeling for large-vocabulary Mandarin speech recongnition. Traditionally, large-vocabulary speech recognition is almost to employ statistical language model. By calculating million of bigram(or trigram) probability model is also having the drawback. Because we can not recognition the OOV(out-of-vocabulary) words(including determiner-measure compoundi word, name entity, and affix word). Because these reasons, we probe into the statistic language model which mixs word and subword. By this way, we not only hope that increasing the coverage of lexicon, but also decreaing the number of words which we can’t recognition correctly. This thesis divides three parts. First, we explore the applicability of the corpus to be used to build the language model, and to observe the contents of corpus whether fit to build the language model or not. We delete the misfit contents and correct the wrong words. We hope to promote the whole recognition rate. Second, we want to train the statistic language model which mixs word and subword, and probe into the tactics that collect the entiry of recognition lexicon to building language model which have the word and the subword. Finally, we use two-stage framework to recognition, and further analyize the result of two-stage experiment. In order to prove the efficiency of the method, we observed the numer of the class recognized correctly is obviously increasing, and recognition rate is 60.86% upto 62.85%. The phenomenon have identified that this framework is efficient.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT079613545 http://hdl.handle.net/11536/41982
Appears in Collections:	Thesis

Files in This Item:

354501.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.