Title: | 以階層式韻律模型為基礎之中文半隱藏式馬可夫模型語音合成器 A HSMM-based Mandarin Speech Synthesizer Based on Hierarchical Prosody Model |
Authors: | 吳文良 陳信宏 電信工程研究所 |
Keywords: | 語音合成;韻律模型;synthesis;prosody model |
Issue Date: | 2011 |
Abstract: | 本論文目標為引入階層式韻律模型,進一步提升以馬可夫模型為基礎之合成器表現。首先引入韻律模型相關之韻律標記-音節邊界停頓標記與音節韻律狀態,將其運用到頻譜模型訓練過程,在決策樹分群階段改以韻律標記取代傳統語言資訊,改以介於上層語法資訊與下層音節資訊間的中層韻律資訊供決策樹分群使用,韻律標記除考量語言資訊外,更同時考量了聲學上的資訊,故應比語言資訊與頻譜更加相關,經實驗證實,韻律標記確實可提供勝過語言資訊的分群能力,訓練出更好的頻譜模型。接著進一步考慮合成時韻律模型的運用,因合成階段僅有文字,但欲取得標記需同時具有聲學與語言資訊,故本論文提出以條件式隨機域的方式訓練以文字預估韻律標記的模型,由於其可同時考量全域觀察序列之影響,並且利用前後狀態相關性進行模型學習,對於具時間相關性的參數預估應極有幫助,從實驗結果可發現,預估得到的韻律狀態,大多皆能符合音節邊界停頓對應的轉移特性。最後結合頻譜模型、韻律模型與預估得到之韻律標記,即為一完整合成系統,此系統具韻律變化豐富之優點,但因音節邊界停頓預估仍不夠好,導致部分合成語音的自然度欠佳,此有待未來繼續努力。 In this thesis, we introduce the hierarchical prosody model to further improve the HMM-based synthesis system performance. First, we apply two types of prosodic tags, prosodic breaks and prosody states, to the spectral model training process. In the process of decision tree clustering, we replace the high-level linguistic features with the middle-level prosodic tags to cluster context dependent model. For the prosodic tags labeling, we consider not only linguistic features but also acoustic features. We suggest it be more related to spectrum than considering linguistic features only. The experiment confirms that our proposed method is better than the conventional method considering linguistic features only in the clustering process. Second, in the synthesis stage, there is no way to label the prosodic tags of the text with the prosody model owing to the lack of acoustic features. As a result, we propose the conditional random fields(CRFs) method to estimate two types of prosodic tags according to the input text information. Because during the CRF model training process, it considers all the observation sequences and the neighboring output states, it is contributive to estimate the time-dependent parameter. The results of experiment show the transition of prosody states matches the corresponding prosodic breaks. Last, we build our proposed complete synthesis system by combining the training spectral model, the prosody model and the estimating prosodic tags, which has the advantage of prosodic diversity. Nevertheless, it is still not good enough for the prosodic break prediction. The prediction results degrade the naturalness of synthesis speech, thus improving the prosodic break prediction will be the future work. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079813506 http://hdl.handle.net/11536/46989 |
Appears in Collections: | Thesis |
Files in This Item:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.