標題: 中英夾雜語音之階層式韻律架構建立與語音合成之應用
Prosody Hierarchy Construction for Mixed Chinese-English Spelling Speech and its Application to TTS
作者: 蔡承燁
Tsai, Cheng-Yeh
Chen, Sin-Horng
關鍵字: 中英夾雜;韻律模型;語音合成;speech synthesis;prosody labeling
公開日期: 2010
摘要: 本論文針對以中文文句為主體但內含英文字母之中英夾雜文句,透過語言參數和聲學參數間的關係,建立一個中英夾雜的韻律模型,並完成自動化的韻律標記。本研究所標記的韻律標記為停頓標記及韻律狀態,其中停頓標記表示韻律單元的邊界,而韻律狀態的序列表示上層韻律單元的變化。透過分析訓練出的模型參數,探討停頓標記、聲學參數、語言參數和上層韻律狀態的關係。由實驗結果顯示英文字母之上層韻律狀態是隨著整體中文語句的韻律變化而起伏,而停頓標記則是在code-switch處會有較強的韻律斷點。此外也發現到名詞片語的韻律層次結構和其語法結構有很高關聯性。 最後利用此模型提出兩種韻律產生方法,第一種為藉由停頓標記的預估,產生韻律層次的文脈相關資訊,透過HTS產生韻律參數,第二種則是應用前述的韻律模型直接預估韻律參數。由客觀評估的實驗結果顯示,第一種方法的確能改善傳統HTS所產生之韻律參數,第二種方法則是在音節長度預測有顯著的效果。而主觀評估的結果也顯示第一種方法在聽覺上有最佳的自然度表現,代表透過本研究所預估的停頓標記能抓到更自然的韻律節奏變化。
In this thesis, an unsupervised joint prosody labeling and modeling (PLM) method for mixed Chinese-English word spelling speech is proposed. It labels an unlabeled corpus with two types of prosodic tags (i.e., break type of inter-syllable juncture and prosodic state of syllable) and builds four prosodic models simultaneously. The break tags can be used to delimit prosodic constituents of a hierarchical prosody structure, and the prosodic state can be used to construct the prosodic feature patterns of prosodic constituents. The four prosodic models describe the relationships of acoustic prosodic features, prosodic tags of utterances, and the linguistic features of the associated texts. The experimental results showed that prosodic variation in English word spelling was influenced by both the prosodic state that describes underlying intonation and Chinese tone borrowing effect. Besides, the relationship between hierarchical noun phrase structure and corresponding break type was also analyzed. The analysis suggested that magnitude of the break type was highly correlated with syntactic hierarchy in a noun phrase. Lastly, we propose two prosody generation methods for mixed Chinese-English word spelling Text-to-Speech system (TTS) based on PLM. In the first method, a break predictor is constrcted by CART method. Then, the related linguistic features and the predicted break tags are used for HMM-based Text-to-Speech system (HTS) training. In the second method, PLM is directly used as a prosody generator. Experimental results confirmed that the proposed method one was superior to the conventional HTS that only use linguistic features both in objective and subjective tests. Besides, the proposed method two was significantly better than the conventional HTS method at syllable duration prediction. Therefore, we conclude that the proposed PLM method was successful in prosody labeling and modeling for constructing a mixed Chinese-English word spelling TTS.


  1. 356701.pdf

