結合情緒合成之語音轉換系統

標題:	結合情緒合成之語音轉換系統 Voice Conversion System Integrated with Emotional Speech Synthesis
作者:	李旻軒 Lee, Min-Hsuan 黃志方成維華 Huang, Chih-Fang Chieng, Wei-Hua 工學院聲音與音樂創意科技碩士學位學程
關鍵字:	語音轉換;情緒語音合成;高斯混和模型;人工類神經網路;線性調整模型;Voice conversion;Emotional speech synthesis;Gaussian mixture model;Artificial neural network;Linear modification model
公開日期:	2014
摘要:	語音轉換是將一個特定語者的聲音經過處理後，使它聽起來好像另一個目標與者講話的聲音。在本論文中，語音轉換的過程涉及訓練和轉換兩個階段，在訓練階段，一些用來描述語者身分的特徵參數先分別的從特定語者和目標語者萃取出來，再利用高斯混合模型(GMM)和人工類神經網絡(ANN)分別地建立兩個語者的特徵參數之間的關係，在轉換階段，先從輸入語音中提取特徵參數，再透過基於人工類神經網絡和基於高斯混合模型的變換函數分別地進行變換。在本篇論文中使用了頻譜、激發和韻律特徵參數來表示語者的身份。為了增強語音轉換系統的情緒表現，情緒語音合成模組也被整合到語音轉換系統的轉換階段中。在情緒語音合成模組中，線性調整模型(LMM)被用來修改韻律參數而調整過後的韻律參數則被用來合成具有情緒的語音。測驗的結果表示，以人工類神經網絡為基礎的語音轉換系統較基於高斯混和模型的語音轉換系統表現更佳，然而相較於真實的語音，合成的情緒語音自然度仍有改進空間，將於未來繼續精進。 Voice conversion (VC) is the process that the utterance of a speaker is transformed so that it is sounded as if spoken by a specified target speaker. In this thesis, the process of voice conversion involves training and transformation phases. In the training phase, some features illustrating the identity of a speaker are extracted from the source and target speaker individually and their relationships are captured by Gaussian mixture model (GMM) and artificial neural network (ANN) simultaneously. In the transformation phase, the features are extracted from the input speech and transformed by ANN and GMM-based conversion function respectively. The spectral, excitation, and prosodic features are used in this thesis to represent the speaker’s identity. In order to enhance the expressiveness of the VC system, the module of emotional speech synthesis is also integrated into the transformation phase of the VC system. The linear modification model (LMM) is adopted to modify the prosodic parameters and the modified prosodic parameters are used to synthesize the emotional utterance. The results of evaluation show that the ANN-based VC system is better than GMM-based VC system, and the most synthesized emotional utterance can be identified. However, the synthesized utterance is not natural enough in comparison with the real utterance. In the future, the naturalness of the synthesized utterances will be improved.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT070151905 http://hdl.handle.net/11536/76309
Appears in Collections:	Thesis