标题: 基于多频带分析与类神经模糊网路的噪音下语音切割与补偿
Noisy Speech Segmentation/Enhancement with Multiband Analysis and Neural Fuzzy Networks
作者: 吴俊德
Gin-Der Wu
林进灯
Chin-Teng Lin
电控工程研究所
关键字: 语音切割;语音补偿;多频带;mel刻度滤波频带;频谱分析;字边界侦测;自我建构类神经模糊推论网路;递回式自我建构类神经模糊推论网路;speech segmentation;speech enhancement;multiband;mel-scale filter bank;spectrum analysis;word boundary detection;self-constructing neural fuzzy inference network;recurrent self-organizing neural fuzzy inference network
公开日期: 1999
摘要: 本论文主要目的在解决背景噪音使语音切割与补偿效果降低的问题。首先,我们提出一种新的语音切割方法(ATF-based SONFIN algorithm)于固定噪音等级的环境。这方法包含一个由我们所提出的adaptive time-frequency parameter (ATF)参数,此参数可以同时粹取噪音下语音讯号于时间和频谱上的重要特征。ATF参数是将Junqual等人所提出的TF参数从单一频带发展成多频带频谱分析,而此多频带分析可以帮助我们在背景噪音干扰下切割出语音讯号。ATF参数利用mel-scale frequency bank适应性地选取出适当频带来产生有用的频谱讯息。这ATF参数可以提升TF-based robust algorithm辨识率约3%,而此TF-based robust algorihm在前人研究中被视为噪音下语音切割效果最佳的演算法。ATF参数也降低由语音切割造成的辨识错误率到20%。基于此ATF参数,我们进一步提出利用前向式自我建构类神经模糊推理网路(SONFIN)的语音切割演算法来侦测噪音干扰下语音讯号的正确位置。因为SONFIN具有自我学习能力,所以此ATF-based SONFIN algorithm可避免以尝试错误方式来决定门槛值(thresholds)和暧昧的切割法则(ambiguous rules)。和一般类神经网路比较,此SONFIN总是能快速学习并找到自己最经济的网路大小。我们的实验结果也显示SONFIN的效果不会明显地受到训练资料大小的影响。ATF-based SONFIN algorithm比TF-based robust algorithm辨识率高约5%。它也减少由语音切割造成的辨识错误率到10%,TF-based robust algorithm约为30%,Lamel等人所提修改版语音切割演算法则为50%。
一般我们所使用的语音切割演算法总是假设背景噪音等级为固定。事实上,背景噪音等级可能在录音的过程中急遽变化。语音讯号会更进一步地受到不稳定的环境噪音所干扰,例如:移动声、引擎运转声、速度改变声、采煞车声、撞击声等等。这便是一般语音切割演算法无法在变动背景噪音等级情况下运作良好的原因。为解决此问题,我们提出了minimum mel-scale frequency band (MiMSB)参数,它藉由适应性地从mel-scale frequency bank多频带分析中,选出一个最小能量频带来估测变动的背景噪音等级。根据此MiMSB参数,某些用来判断语音切割的预定门槛值(preset thresholds)在整个录音过程中,将不再是固定值。这些门槛值会依照MiMSB参数而调整。我们还提出了enhanced time-frequency (ETF)参数。它也能在变动噪音等级环境下,利用多频带分析来粹取有用的频谱讯息。基于MiMSB与ETF参数,我们最后提出了新的语音切割演算法(MiMSB-ETF-based algorithm)于变动噪音等级环境。在此变动背景噪音等级情况下,MiMSB-ETF-based algorithm比TF-based robust algorithm辨识率高约5%。它也减少由语音切割造成的辨识率错误率到25%,而TF-based robust algorithm平均为34%。
另外,我们提出了refined time-frequency (RTF)参数来改良ETF参数。此RTF参数比ETF参数能粹取更有用的频谱讯息。基于此RTF参数,我们进一步提出利用递回式自我建构类神经模糊推理网路(RSONFIN)来做语音切割。因为RSONFIN能处理时序关系,所以此RTF-base RSONFIN可以发现背景噪音等级的变动,并在变动背景噪音等级情况下正确地做语音切割。和一般类神经网路比较,此RSONFIN总是能快速学习并找到自己最经济的网路大小。因为RSONFIN具有自我学习能力,所以此RTF-based RSONFIN可避免一般语音切割演算法以尝试错误方式来决定暧昧的切割法则(ambiguous rules)。在白色杂讯干扰的变动背景噪音等级情况下的实验显示此语音切割演算法比TF-based robust algorithm辨识率高约12%。
一般我们所用单频带减型语音补偿演算法(Single-channel subtractive-type speech enhancement algorithm)总是假设背景噪音等级为固定或缓慢地变动。事实上,背景噪音等级可能快速地变动。这情况常造成错误的语音切割,而错误的语音切割会造成错误的语音补偿程序。为解决这个问题,我们提出了新的语音补偿程序。这新的补偿程序使用我们先前所提的RTF-based RSONFIN algorithm,它可在变动背景噪音等级情况下正确地做语音切割。另外,我们提出了新的MiFre参数来改良MiMSB参数,它比MiMSB能粹取更有用的背景噪音等级讯息。利用此MiFre参数,我们所提的新减型语音补偿程序所使用的噪音等级讯息不止能在无语音讯号时估测,也能在有语音讯号时估测。在变动背景噪音等级情况下的测试中,此新的语音补偿程序比传统语音补偿程序效果佳。
This thesis addresses the problem that background noise acoustically added to speech can decrease the performance of speech segmentation and enhancement. In order to improve the performance of these applications, new methods have already been developed in this thesis. First, we proposed a new speech segmentation method (ATF-based SONFIN algorithm) in fixed noise-level environment. This method contains an adaptive time-frequency (ATF) parameter for extracting both the time and frequency features of noisy speech signals. The ATF parameter extends the TF parameter proposed by Junqua et al. from single band to multiband spectrum analysis, where the frequency bands help to make the distinction of speech and noise clear. The ATF parameter can extract useful frequency information by adaptively choosing proper bands of the mel-scale frequency bank. The ATF parameter increased the recognition rate by about 3% of a TF-based robust algorithm which has been shown to outperform several commonly used algorithms for word boundary detection in the presence of noise. The ATF parameter also reduced the recognition error rate due to endpoint detection to about 20%. Based on the ATF parameter, we further proposed a new word boundary detection algorithm by using a self-constructing neural fuzzy inference network (called SONFIN) for identifying islands of word signals in noisy environment. Due to the self-learning ability of SONFIN, this ATF-based SONFIN algorithm avoids the need of empirically determining thresholds and ambiguous rules in normal word boundary detection algorithms. As compared to normal neural networks, the SONFIN can always find itself an economic network size in high learning speed. Our results also showed that the SONFIN's performance is not significantly affected by the size of training set. The ATF-based SONFIN achieved higher recognition rate than the TF-based robust algorithm by about 5%. It also reduced the recognition error rate due to endpoint detection to about 10%, compared to an average of approximately 30% obtained with the TF-based robust algorithm, and 50% obtained with the modified version of the Lamel et al. algorithm.
Commonly used robust word boundary detection algorithms always assume that the background noise level is fixed. In fact, the background noise level may vary during the procedure of recording. The speech signal is further complicated by nonstationary backgrounds where there may exist concurrent noises due to movements, engine running, speed change, braking, slams, etc. This is the major reason that most robust word boundary detection algorithms cannot work well in the condition of variable background noise level. To solve this problem, we proposed a minimum mel-scale frequency band (MiMSB) parameter which can estimate the varying background noise level by adaptively choosing one band with minimum energy from the mel-scale frequency bank. With the MiMSB parameter, some preset thresholds used to find the boundary of word signal are no longer fixed in all the recording interval. These thresholds will be tuned according to the MiMSB parameter. We also proposed an enhanced time-frequency (ETF) parameter, and it can extract useful frequency information from mutiband analysis in variable noise-level environment. Based on the MiMSB and ETF parameters, we finally proposed a new method (MiMSB-ETF-based algorithm) for word boundary detection in variable noise-level environment. This MiMSB-ETF-based algorithm achieved higher recognition rate than the TF-based robust algorithm by about 5% in variable background noise level condition. It also reduced the recognition error rate due to endpoint detection to 25%, compared to an average of 34% obtained with the TF-based robust algorithm.
In addition, we proposed a refined time-frequency (RTF) arameter to improve the ETF parameter. This RTF parameter can extract more useful frequency information than ETF parameter. Based on this RTF parameter, we further proposed a new word boundary detection algorithm by using a recurrent self-organizing neural fuzzy inference network (RSONFIN). Since RSONFIN can process the temporal relations, the proposed RTF-based RSONFIN algorithm can find the variation of the background noise level and detect correct word boundaries in the condition of variable background noise level. As compared to normal neural networks, the RSONFIN can always find itself an economic network size with high learning speed. Due to the self-learning ability of RSONFIN, this RTF-based RSONFIN algorithm avoids the need for empirically determining ambiguous decision rules in normal word boundary detection algorithms. Experiments in white noise show that this new algorithm achieves higher recognition rate than the TF-based robust algorithm by about 12% in variable background noise level condition.
Commonly used single-channel subtractive-type speech nhancement algorithms always assume that the background noise level is fixed or slowly varying. In fact, the background noise level may vary quickly. This condition usually results in wrong speech/noise detection, and this wrong detection results in wrong speech enhancement process. To solve this problem, we proposed a new subtractive-type speech enhancement scheme. This new enhancement scheme uses the RTF-based RSONFIN algorithm developed by us previously to detect the word boundaries in the condition of variable background noise level. In addition, a new parameter (MiFre) is used to extract more useful background noise level information than MiMSB parameter. Based on this MiFre parameter, the noise level information used for the new subtractive-type speech enhancement scheme can be estimated not only during speech pauses but also during speech segments. Finally, this new subtractive-type speech enhancement scheme has been tested and found to perform well not only in variable background noise level condition but also in fixed background noise level condition.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT880591096
http://hdl.handle.net/11536/66329
显示于类别:Thesis