標題: 一種韻律輔助中文語音辨認系統及其應用
A New Prosody-Assisted Mandarin ASR System and Its Application
作者: 楊智合
Yang, Jyh-Her
陳信宏
廖元甫
Chen, Sin-Horng
Liao, Yuan-Fu
電信工程研究所
關鍵字: 韻律輔助語音辨識;韻律模型;韻律階層結構;Prosody-assisted ASR;Prosody modeling;Prosody-hierarchy structure
公開日期: 2011
摘要: 本論文提出一種新的韻律輔助之中文語音辨識系統,它不同於以往較簡單的作法,是利用較精細的四層中文韻律結構模式來幫助中文語音辨認,本論文利用先前已開發的韻律標記與韻律模式演算法從大量未經人工標記的語料庫中自動產生訓練出12種韻律模型,並以兩階段方式將其加入到自動語音辨認系統中,對系統中第一個階段,也就是傳統隱藏式馬可夫模型(HMM)辨認器所產生的詞圖(word lattice)作重新評分的動作,如此可以得到更正確的詞辨認序列;此外,系統第二個階段同時解碼出多種資訊,包含詞性(POS)、標點符號(PM)以及用來建構測試語料之階層式韻律架構的兩種韻律標記。本論文實驗語料使用TCC300語料庫中的朗讀式長句,同時實驗中引入一個因子式語言模型,它是一個描繪詞、詞性及標點符號三者之間關係的模型,以此當作基準(baseline)辨認效能。本研究在加入所有韻律資訊後之實驗結果對於詞(word)、字(character)、音節(syllable)的錯誤率分別為20.1%、13.6%及9.4%,與baseline比較則分別改善了4.1%、4.0%及2.4%的絕對錯誤率(16.9%、22.6%及20.6%的相對錯誤率)。由實驗結果分析,發現本系統能成功修正許多辨認錯誤是來自於搶詞與聲調錯誤。 在應用上,我們使用此辨認方法建立一種新的以模式為基礎的中文語音韻律編碼系統,在編碼端,以此韻律輔助語音辨認系統由輸入語音產生語言參數及韻律標記加以編碼;在解碼端,將這些語言參數及韻律標記資訊解碼,用以建構出音節基頻軌跡、音節長度、音節能量位準及音節間的停頓長度,接著以HMM語音合成器結合語音的頻譜參數合成出語音訊號,由TCC300語料之實驗證實,合成語音在低資料率543 bits/sec下仍有高的聲音品質。
This dissertation presents a new prosody-assisted automatic speech recognition (ASR) system for Mandarin speech. It differs from the conventional approach of using simple prosodic cues on employing a sophisticated prosody modeling approach based on a 4-layer prosody-hierarchy structure to automatically generate 12 prosodic models from a large unlabeled speech database by the joint prosody labeling and modeling (PLM) algorithm proposed previously. By incorporating these 12 prosodic models into a two-stage ASR system to rescore the word lattice generated in the first stage by the conventional Hidden Markov model (HMM) recognizer, we can obtain a better recognized word string. Besides, some other information can also be decoded, including part of speech (POS), punctuation mark (PM), and two types of prosodic tags which can be used to construct the prosody-hierarchy structure of the testing speech. Experimental results on the TCC300 database, which consists of long paragraphic utterances, showed that the proposed system significantly outperformed the baseline scheme using an HMM recognizer with a factored language model which models word, POS, and PM. Performances of 20.7%, 14.4%, and 9.6% in word, character, and base-syllable error rates were obtained. They corresponded to 3.7%, 3.7%, and 2.4% absolute (or 15.2%, 20.4%, and 20% relative) error reductions. By an error analysis, we found that many word segmentation errors and tone recognition errors were corrected. With the success of the prosody-assisted ASR system, we conduct an application to speech coding. A new model-based Mandarin-speech coding system is proposed. It employs the prosody-assisted ASR with the hierarchical prosodic model (HPM) to generate from the input speech enriched transcriptions, including linguistic features, prosodic tags and spectral parameters in the encoder. By sending these features to the decoder, we can first reconstruct the prosodic-acoustic features of syllable pitch contour, syllable duration, syllable energy level, and inter-syllable pause duration by HPM using the linguistic features and prosodic tags; and then combined with spectral parameters to reconstruct the input speech signal by an HMM-based speech synthesizer. Experimental results show that the reconstructed speech has good quality at a low data rate of 543 bits/s.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079313824
http://hdl.handle.net/11536/40521
顯示於類別:畢業論文


文件中的檔案:

  1. 382401.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。