標題: 中文文句翻語音系統之改進
An Improvement on the Mandarin Text-to-Speech System
作者: 盧鵬任
Lu, Peng-Ren
Sin-Horng Chen
關鍵字: 語音合成;Text-to-Speech;TTS
公開日期: 1996
摘要: 本論文針對交通大學語音信號處理實驗室先前所發展之中文文句翻語音系 統加以改進。此系統包含四個主要部分:文句分析器、RNN韻律訊息產生 器、417基本音節波形表及PSOLA語音合成器。輸入文句經由文句分析器解 析後抽取出語言參數,韻律產生器則根據這些語言參數得出相應之韻律參 數;最後PSOLA合成器依據韻律參數及語言參數合成出所要之語音波形。 在此研究中,我們對系統作了許多改進。首先,我們將詞庫的數量由八萬 詞增加至十一萬詞,建構詞典樹來加快文句處理的速度,另外加入簡易的 構詞法則來輔助文句分析。而韻律訊息產生器則為了降低計算的複雜度, 在不影響合成語音的自然度下將詞類分類由44類降為22類。至於417基本 音節波形表則以單音節錄製取得,此方法不僅簡化處理過程,且可得到較 好的音質。最後,我們將系統由DOS轉移至Windows 95環境下,並將系統 架構改成動態函式庫,方便新的應用程式之發展。 In this thesis, the improvement of a Mandarin TTS system developed previously in the Speech Processing Lab of NCTU is performed. The system consists of four main parts: text analyzer, RNN-based prosodic information generator, waveform table of 417 base-syllables, and PSOLA synthesizer. Input texts are first analyzed in the text analyzer. Then, the RNN prosody generator is used to generate the prosodic information by using linguistic features extracted from the outputs of text analysis. Meanwhile, the corresponding waveform template sequence are extracted from the waveform table. Lastly, the PSOLA synthesizer is used to generate the output synthesized speech by adjusting the prosody of the waveform template sequence. In this study, improvements of the system on many aspects are done. We first extend the lexicon size of the text analyzer from 80,000 words to 110,000 words. The coverage of the lexicon is hence greatly increase. Then, a word pronunciation tree is constructed to speed up the text-analysis process. Some simple phonological rules are also incorporated into the text analyzer. The number of POS types used in the RNN prosody generator is then reduced from 44 to 22 to reduce its computational complexity while keeping the naturalness of the synthesized speech being undegraded. Then, a new method of producing the waveform table of 417 base-syllables using utterances of isolated syllables is proposed. This not only increases the quality of the synthesized speech but also greatly simplifies the process of adding a new speaker*s speech to the system. Lastly, we change the system operating environment from DOS to Windows 95. The software architecture is also changed to a dynamic library form. This makes the developments of new applications more easy.