標題: 中語聲韻母之模型建立與辨識方法
Initial-Final Modeling and Recognition of Mandarin Speech
作者: 周嘉賢
Chou, Chia-Shyan
劉啟民
Chi-Min Liu
資訊科學與工程研究所
關鍵字: 聲韻母;中語;發音長度;非特定語者;Initial-Final;Mandarin;Duration;Speaker-Independent
公開日期: 1995
摘要: 在本篇論文中,我們將分三個部份來討論中文語音辨識上的一些問題。 首先,我們要探討的問題在於中語聲韻母模型建立的過程中所需要考慮的 因素,包括了聲韻母連音部份 (transition area) 的處理、隱藏性馬可 夫模型 (HMM) 狀態個數對辨識效率的影響。 最重要的一點,我們在這個 部份利用了大量的測試資料來定義出中文聲母的混淆集合 (confusion sets),也用已有的關於發聲器官變化的知識以及實驗的結果來佐證我們 的定義。 同時我們也提出了所謂的可接受的錯誤(acceptable errors), 這些因為地域上、發音姿勢上或習慣上造成的差異將是無法避免的,只能 在加入語言模型 (language model) 之後獲得改善。 再者,我們針對在 第一部份定義出來的混淆集合中,找出三個混淆集合,包括了上顎音(ㄐ 、ㄑ、ㄒ)、捲舌音(ㄓ、ㄔ、ㄕ)、齒擦音(ㄗ、ㄘ、ㄙ),這三個混 淆集合有一個相似的特性,那就是在同一個集合中的所有元素的發音長度 (duration) 相差很大。利用這個特性,我們在傳統的 Viterbi辨識方法 中引入了發音長度的特徵,針對特定語者系統降低了這些集合的大約47% 的錯誤,整個系統的辨識率也提昇了大約0.7%,至於在非特定語者系統中 ,我們得到大約1% 的好處。 最後一部份,我們嘗試著加入鼻音化韻母的 考量,這是為了改善跟鼻音有關的聲母的混淆情形,可惜的是我們的推論 並沒有成功。 論文的最後我們建立了一套非特定語者 (speakerindependent) 系統,實驗的結果顯示整體的混淆情況與特定語 者系統並不會有什麼差異,長度的限制也的確降低了系統的錯誤率。 This thesis focuses on three issues of Mandarin speech recognition. First, we consider the modeling of the Mandarin speech including the basic modeling units, the coarticulation effect between INITIALs and FINALs, and the state number of a HMM. Most importantly, we use a large amount of speech from two speakers to define the confusing sets of Mandarin INITIALs. We affirm the definition with information on articulator gestures and experiment results. From the experiments, we also introduce the concept of acceptable errors. An acceptable error is an utterance error that occurs due to factors such as improper or customary articulator manners of a large amount of persons. This problem can be treated as acceptable for syllable or word recognition and can be overcome with the help from language models. Second, we focus on three sets among the previous defined confusions, including palatals(ㄐ、ㄑ、ㄒ), retroflexions (ㄓ、ㄔ、ㄕ), and dental sibilants(ㄗ、ㄘ、ㄙ ). The common property of these three sets is that all elements in the same set are different in duration. We develop the algorithms that include the duration information into the conventional Viterbi algorithm. The experiment results show an error reduction 45% for the three sets and 0.7% for the total errors in speaker dependent systems. The third issue is on the confusions of nasal consonant. We try to solve the problem by introducing the nasalized FINAL models. However, it seems not work due to the variance of nasalization level. Finally, all the above three issues are conducted through speaker independent experiments. The results show that the overall confusions will not change and the induction of duration will enhance the performance.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT840392031
http://hdl.handle.net/11536/60374
顯示於類別:畢業論文