標題: 中文連續語音辨認後處理之進一步研究
A Further Study on Post-Processing of Continuous Mandarin Speech Recognition
作者: 張志豪
Zhi Hao Zhang
Xin Hong Chen
關鍵字: 中文連續語音辨認;Continuous Mandarin Speech Recognition
公開日期: 2008
摘要: 本論文分成兩個部份,第一部分探討建立語言模型時所使用的文字資料庫的適用性,觀察文字資料庫的內容是否適合建立語言模型,刪除不適合的內容及更正錯誤文字,希望能提升整體的辨識率。第二部分是針對辨認結果,以有意義的長詞為目標,而非只和辨認用詞典中的詞比對,為此我們多考慮了二種構詞,包括數量複合詞及人名,結果辨認率下降許多,顯示原先辨識結果將許多有意義的這二類長詞辨識成意義不完整或錯誤的短詞。由於辨認用詞典無法包含所有構詞,我們因此嘗試將常被用來構成這些詞的一字詞或subword加入詞典,希望這些構詞被辨認成正確的短詞串,以便在未來經後處理產生正確構詞。實驗結果顯示以subword作為構詞成分較一字詞為佳。
The thesis divided into two parts, one is to explore the applicability of the corpus to be used to build the language model, and to observe the contents of corpus whether fit to build the language model or not. We delete the misfit contents and correct the wrong words. We hope to promote the whole recognition rate. The second part is that aim at the recognizable result. We use the meaningful long term for goal, not the meaningless short term. For these, we consider two compound words that include determiner-measure compound and name entity . The result is that the recognition rate goes down a lot. That shows the recognizable result let many meaningful these two kinds of long term to recognize incomplete meaning or wrong short term. Because our recognition can not include all compound words, we try to put one length word or subword which are often used to compound these words into lexicon. We hope these compound wards can be recognized the correct strings of word, then it can produce the right compound words in the future. The experimental result is that the subword is better than the string of word to be the component of compound words.


  1. 355901.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。