Title: 中文文句翻語音系統之改進
An Improvement on the Mandarin Text-to-Speech System
Authors: 盧鵬任
Lu, Peng-Ren
陳信宏
Sin-Horng Chen
電信工程研究所
Keywords: 語音合成;Text-to-Speech;TTS
Issue Date: 1996
Abstract: 本論文針對交通大學語音信號處理實驗室先前所發展之中文文句翻語音系
統加以改進。此系統包含四個主要部分:文句分析器、RNN韻律訊息產生
器、417基本音節波形表及PSOLA語音合成器。輸入文句經由文句分析器解
析後抽取出語言參數,韻律產生器則根據這些語言參數得出相應之韻律參
數;最後PSOLA合成器依據韻律參數及語言參數合成出所要之語音波形。
在此研究中,我們對系統作了許多改進。首先,我們將詞庫的數量由八萬
詞增加至十一萬詞,建構詞典樹來加快文句處理的速度,另外加入簡易的
構詞法則來輔助文句分析。而韻律訊息產生器則為了降低計算的複雜度,
在不影響合成語音的自然度下將詞類分類由44類降為22類。至於417基本
音節波形表則以單音節錄製取得,此方法不僅簡化處理過程,且可得到較
好的音質。最後,我們將系統由DOS轉移至Windows 95環境下,並將系統
架構改成動態函式庫,方便新的應用程式之發展。
In this thesis, the improvement of a Mandarin TTS system
developed previously in the Speech Processing Lab of NCTU is
performed. The system consists of four main parts: text
analyzer, RNN-based prosodic information generator, waveform
table of 417 base-syllables, and PSOLA synthesizer. Input texts
are first analyzed in the text analyzer. Then, the RNN prosody
generator is used to generate the prosodic information by using
linguistic features extracted from the outputs of text analysis.
Meanwhile, the corresponding waveform template sequence are
extracted from the waveform table. Lastly, the PSOLA synthesizer
is used to generate the output synthesized speech by adjusting
the prosody of the waveform template sequence. In this study,
improvements of the system on many aspects are done. We first
extend the lexicon size of the text analyzer from 80,000 words
to 110,000 words. The coverage of the lexicon is hence greatly
increase. Then, a word pronunciation tree is constructed to
speed up the text-analysis process. Some simple phonological
rules are also incorporated into the text analyzer. The number
of POS types used in the RNN prosody generator is then reduced
from 44 to 22 to reduce its computational complexity while
keeping the naturalness of the synthesized speech being
undegraded. Then, a new method of producing the waveform table
of 417 base-syllables using utterances of isolated syllables is
proposed. This not only increases the quality of the synthesized
speech but also greatly simplifies the process of adding a new
speaker*s speech to the system. Lastly, we change the system
operating environment from DOS to Windows 95. The software
architecture is also changed to a dynamic library form. This
makes the developments of new applications more easy.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT850436005
http://hdl.handle.net/11536/62077
Appears in Collections:Thesis