標題: | 使用韻律信息之中文自發性語音辨認 A Prosody-Assisted Mandarin Spontaneous Speech Recognition |
作者: | 黃仰駿 Huang, Yang-Chun 陳信宏 Chen Sin-Horng 電信工程研究所 |
關鍵字: | 自發性語音;語音辨認;spontaneous speech;speech recognition |
公開日期: | 2014 |
摘要: | 近年來朗讀式語音辨識已有相當不錯的效能,但自發性語音辨認卻因為語速較快、語法不規則、語流不流暢等原因仍舊困難,本論文探討中文自發性語音辨認,研究重點在語言模型的建立及加入韻律信息的辨認過程。在語言模型建立上,考慮語者說話猶豫時所使用的感嘆詞及無意義的慣用插語,並利用語言模型調適來解決文字語料不足及文法語流特性和朗讀語音不同的問題,以建立一套自發性語言模型;在辨認過程上,使用兩階段辨認來加入韻律信息協助辨認,首先在第一階段辨認使用傳統聲學模型及bigram語言模型產生一個word lattice,接著在第二階段辨認先擴展語言模型為factored語言模型,再加入韻律邊界停頓資訊與音節韻律狀態資訊,經過重新評分後得到一條最佳路徑,並同時解碼出相關資訊。使用中研院MCDC語料作實驗,獲得詞、字及音節的辨識率分別為58.29%、64.94%及68.89%,較傳統只使用第一階段辨認的作法絕對辨認率改善了4.43%、4.6%及3.06%。經辨認結果分析發現,對於正常語流而言,加入韻律信息能夠改善搶詞及聲調辨認錯誤;但對於不正常語流來說,改善的效能非常有限。 In recent years, the Mandarin read-speech recognition technology is quite mature. However, it is still difficult for spontaneous speech recognition due to high speaking rate and the existence of disfluent speech events. This thesis discusses Mandarin spontaneous speech recognition, focusing on language model establishment and the process of prosody-assisted recognition. In the language model establishment, two particular words of particle and marker are added to the vocabulary to model the disfluency phenomena of spontaneous speech. Besides, language model adaptation is employed to solve the problem of the insufficiency of texts of spontaneous speech. In recognition, a two-stage recognition process to incorporate prosodic information is adopted. In the first stage, an acoustic model and a bigram language model is used to generate a word lattice. Then, in the second stage the word lattice is firstly extended to replace the bigram LM with a factorized LM. Then, break-related models and prosodic state-related models of a hierarchical prosodic model are sequentially added to rescore all searching paths in order to find the best recognized word sequence. Experimental results on the Academia Sinica MCDC corpus showed that word, character and base-syllable accuracy rates of 58.29%, 64.94% and 68.89% were achieved. They were better than the results of the baseline system by 4.43%, 4.6% and 3.06%, respectively. By error analysis we find that prosodic information is useful in resolving word segmentation ambiguity and tone pattern confusion for fluent speech part, while it is less effective for disfluent part. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT070160268 http://hdl.handle.net/11536/75676 |
顯示於類別: | 畢業論文 |