标题: 中文文句翻语音系统之改进
An Improvement on the Mandarin Text-to-Speech System
作者: 卢鹏任
Lu, Peng-Ren
陈信宏
Sin-Horng Chen
电信工程研究所
关键字: 语音合成;Text-to-Speech;TTS
公开日期: 1996
摘要: 本论文针对交通大学语音信号处理实验室先前所发展之中文文句翻语音系
统加以改进。此系统包含四个主要部分:文句分析器、RNN韵律讯息产生
器、417基本音节波形表及PSOLA语音合成器。输入文句经由文句分析器解
析后抽取出语言参数,韵律产生器则根据这些语言参数得出相应之韵律参
数;最后PSOLA合成器依据韵律参数及语言参数合成出所要之语音波形。
在此研究中,我们对系统作了许多改进。首先,我们将词库的数量由八万
词增加至十一万词,建构词典树来加快文句处理的速度,另外加入简易的
构词法则来辅助文句分析。而韵律讯息产生器则为了降低计算的复杂度,
在不影响合成语音的自然度下将词类分类由44类降为22类。至于417基本
音节波形表则以单音节录制取得,此方法不仅简化处理过程,且可得到较
好的音质。最后,我们将系统由DOS转移至Windows 95环境下,并将系统
架构改成动态函式库,方便新的应用程式之发展。
In this thesis, the improvement of a Mandarin TTS system
developed previously in the Speech Processing Lab of NCTU is
performed. The system consists of four main parts: text
analyzer, RNN-based prosodic information generator, waveform
table of 417 base-syllables, and PSOLA synthesizer. Input texts
are first analyzed in the text analyzer. Then, the RNN prosody
generator is used to generate the prosodic information by using
linguistic features extracted from the outputs of text analysis.
Meanwhile, the corresponding waveform template sequence are
extracted from the waveform table. Lastly, the PSOLA synthesizer
is used to generate the output synthesized speech by adjusting
the prosody of the waveform template sequence. In this study,
improvements of the system on many aspects are done. We first
extend the lexicon size of the text analyzer from 80,000 words
to 110,000 words. The coverage of the lexicon is hence greatly
increase. Then, a word pronunciation tree is constructed to
speed up the text-analysis process. Some simple phonological
rules are also incorporated into the text analyzer. The number
of POS types used in the RNN prosody generator is then reduced
from 44 to 22 to reduce its computational complexity while
keeping the naturalness of the synthesized speech being
undegraded. Then, a new method of producing the waveform table
of 417 base-syllables using utterances of isolated syllables is
proposed. This not only increases the quality of the synthesized
speech but also greatly simplifies the process of adding a new
speaker*s speech to the system. Lastly, we change the system
operating environment from DOS to Windows 95. The software
architecture is also changed to a dynamic library form. This
makes the developments of new applications more easy.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT850436005
http://hdl.handle.net/11536/62077
显示于类别:Thesis