标题: | 中文文句翻语音系统之改进 An Improvement on the Mandarin Text-to-Speech System |
作者: | 卢鹏任 Lu, Peng-Ren 陈信宏 Sin-Horng Chen 电信工程研究所 |
关键字: | 语音合成;Text-to-Speech;TTS |
公开日期: | 1996 |
摘要: | 本论文针对交通大学语音信号处理实验室先前所发展之中文文句翻语音系 统加以改进。此系统包含四个主要部分:文句分析器、RNN韵律讯息产生 器、417基本音节波形表及PSOLA语音合成器。输入文句经由文句分析器解 析后抽取出语言参数,韵律产生器则根据这些语言参数得出相应之韵律参 数;最后PSOLA合成器依据韵律参数及语言参数合成出所要之语音波形。 在此研究中,我们对系统作了许多改进。首先,我们将词库的数量由八万 词增加至十一万词,建构词典树来加快文句处理的速度,另外加入简易的 构词法则来辅助文句分析。而韵律讯息产生器则为了降低计算的复杂度, 在不影响合成语音的自然度下将词类分类由44类降为22类。至于417基本 音节波形表则以单音节录制取得,此方法不仅简化处理过程,且可得到较 好的音质。最后,我们将系统由DOS转移至Windows 95环境下,并将系统 架构改成动态函式库,方便新的应用程式之发展。 In this thesis, the improvement of a Mandarin TTS system developed previously in the Speech Processing Lab of NCTU is performed. The system consists of four main parts: text analyzer, RNN-based prosodic information generator, waveform table of 417 base-syllables, and PSOLA synthesizer. Input texts are first analyzed in the text analyzer. Then, the RNN prosody generator is used to generate the prosodic information by using linguistic features extracted from the outputs of text analysis. Meanwhile, the corresponding waveform template sequence are extracted from the waveform table. Lastly, the PSOLA synthesizer is used to generate the output synthesized speech by adjusting the prosody of the waveform template sequence. In this study, improvements of the system on many aspects are done. We first extend the lexicon size of the text analyzer from 80,000 words to 110,000 words. The coverage of the lexicon is hence greatly increase. Then, a word pronunciation tree is constructed to speed up the text-analysis process. Some simple phonological rules are also incorporated into the text analyzer. The number of POS types used in the RNN prosody generator is then reduced from 44 to 22 to reduce its computational complexity while keeping the naturalness of the synthesized speech being undegraded. Then, a new method of producing the waveform table of 417 base-syllables using utterances of isolated syllables is proposed. This not only increases the quality of the synthesized speech but also greatly simplifies the process of adding a new speaker*s speech to the system. Lastly, we change the system operating environment from DOS to Windows 95. The software architecture is also changed to a dynamic library form. This makes the developments of new applications more easy. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#NT850436005 http://hdl.handle.net/11536/62077 |
显示于类别: | Thesis |