標題: 基於多頻帶分析與類神經模糊網路的噪音下語音切割與補償
Noisy Speech Segmentation/Enhancement with Multiband Analysis and Neural Fuzzy Networks
作者: 吳俊德
Gin-Der Wu
林進燈
Chin-Teng Lin
電控工程研究所
關鍵字: 語音切割;語音補償;多頻帶;mel刻度濾波頻帶;頻譜分析;字邊界偵測;自我建構類神經模糊推論網路;遞迴式自我建構類神經模糊推論網路;speech segmentation;speech enhancement;multiband;mel-scale filter bank;spectrum analysis;word boundary detection;self-constructing neural fuzzy inference network;recurrent self-organizing neural fuzzy inference network
公開日期: 1999
摘要: 本論文主要目的在解決背景噪音使語音切割與補償效果降低的問題。首先,我們提出一種新的語音切割方法(ATF-based SONFIN algorithm)於固定噪音等級的環境。這方法包含一個由我們所提出的adaptive time-frequency parameter (ATF)參數,此參數可以同時粹取噪音下語音訊號於時間和頻譜上的重要特徵。ATF參數是將Junqual等人所提出的TF參數從單一頻帶發展成多頻帶頻譜分析,而此多頻帶分析可以幫助我們在背景噪音干擾下切割出語音訊號。ATF參數利用mel-scale frequency bank適應性地選取出適當頻帶來產生有用的頻譜訊息。這ATF參數可以提升TF-based robust algorithm辨識率約3%,而此TF-based robust algorihm在前人研究中被視為噪音下語音切割效果最佳的演算法。ATF參數也降低由語音切割造成的辨識錯誤率到20%。基於此ATF參數,我們進一步提出利用前向式自我建構類神經模糊推理網路(SONFIN)的語音切割演算法來偵測噪音干擾下語音訊號的正確位置。因為SONFIN具有自我學習能力,所以此ATF-based SONFIN algorithm可避免以嘗試錯誤方式來決定門檻值(thresholds)和曖昧的切割法則(ambiguous rules)。和一般類神經網路比較,此SONFIN總是能快速學習並找到自己最經濟的網路大小。我們的實驗結果也顯示SONFIN的效果不會明顯地受到訓練資料大小的影響。ATF-based SONFIN algorithm比TF-based robust algorithm辨識率高約5%。它也減少由語音切割造成的辨識錯誤率到10%,TF-based robust algorithm約為30%,Lamel等人所提修改版語音切割演算法則為50%。 一般我們所使用的語音切割演算法總是假設背景噪音等級為固定。事實上,背景噪音等級可能在錄音的過程中急遽變化。語音訊號會更進一步地受到不穩定的環境噪音所干擾,例如:移動聲、引擎運轉聲、速度改變聲、採煞車聲、撞擊聲等等。這便是一般語音切割演算法無法在變動背景噪音等級情況下運作良好的原因。為解決此問題,我們提出了minimum mel-scale frequency band (MiMSB)參數,它藉由適應性地從mel-scale frequency bank多頻帶分析中,選出一個最小能量頻帶來估測變動的背景噪音等級。根據此MiMSB參數,某些用來判斷語音切割的預定門檻值(preset thresholds)在整個錄音過程中,將不再是固定值。這些門檻值會依照MiMSB參數而調整。我們還提出了enhanced time-frequency (ETF)參數。它也能在變動噪音等級環境下,利用多頻帶分析來粹取有用的頻譜訊息。基於MiMSB與ETF參數,我們最後提出了新的語音切割演算法(MiMSB-ETF-based algorithm)於變動噪音等級環境。在此變動背景噪音等級情況下,MiMSB-ETF-based algorithm比TF-based robust algorithm辨識率高約5%。它也減少由語音切割造成的辨識率錯誤率到25%,而TF-based robust algorithm平均為34%。 另外,我們提出了refined time-frequency (RTF)參數來改良ETF參數。此RTF參數比ETF參數能粹取更有用的頻譜訊息。基於此RTF參數,我們進一步提出利用遞迴式自我建構類神經模糊推理網路(RSONFIN)來做語音切割。因為RSONFIN能處理時序關係,所以此RTF-base RSONFIN可以發現背景噪音等級的變動,並在變動背景噪音等級情況下正確地做語音切割。和一般類神經網路比較,此RSONFIN總是能快速學習並找到自己最經濟的網路大小。因為RSONFIN具有自我學習能力,所以此RTF-based RSONFIN可避免一般語音切割演算法以嘗試錯誤方式來決定曖昧的切割法則(ambiguous rules)。在白色雜訊干擾的變動背景噪音等級情況下的實驗顯示此語音切割演算法比TF-based robust algorithm辨識率高約12%。 一般我們所用單頻帶減型語音補償演算法(Single-channel subtractive-type speech enhancement algorithm)總是假設背景噪音等級為固定或緩慢地變動。事實上,背景噪音等級可能快速地變動。這情況常造成錯誤的語音切割,而錯誤的語音切割會造成錯誤的語音補償程序。為解決這個問題,我們提出了新的語音補償程序。這新的補償程序使用我們先前所提的RTF-based RSONFIN algorithm,它可在變動背景噪音等級情況下正確地做語音切割。另外,我們提出了新的MiFre參數來改良MiMSB參數,它比MiMSB能粹取更有用的背景噪音等級訊息。利用此MiFre參數,我們所提的新減型語音補償程序所使用的噪音等級訊息不止能在無語音訊號時估測,也能在有語音訊號時估測。在變動背景噪音等級情況下的測試中,此新的語音補償程序比傳統語音補償程序效果佳。
This thesis addresses the problem that background noise acoustically added to speech can decrease the performance of speech segmentation and enhancement. In order to improve the performance of these applications, new methods have already been developed in this thesis. First, we proposed a new speech segmentation method (ATF-based SONFIN algorithm) in fixed noise-level environment. This method contains an adaptive time-frequency (ATF) parameter for extracting both the time and frequency features of noisy speech signals. The ATF parameter extends the TF parameter proposed by Junqua et al. from single band to multiband spectrum analysis, where the frequency bands help to make the distinction of speech and noise clear. The ATF parameter can extract useful frequency information by adaptively choosing proper bands of the mel-scale frequency bank. The ATF parameter increased the recognition rate by about 3% of a TF-based robust algorithm which has been shown to outperform several commonly used algorithms for word boundary detection in the presence of noise. The ATF parameter also reduced the recognition error rate due to endpoint detection to about 20%. Based on the ATF parameter, we further proposed a new word boundary detection algorithm by using a self-constructing neural fuzzy inference network (called SONFIN) for identifying islands of word signals in noisy environment. Due to the self-learning ability of SONFIN, this ATF-based SONFIN algorithm avoids the need of empirically determining thresholds and ambiguous rules in normal word boundary detection algorithms. As compared to normal neural networks, the SONFIN can always find itself an economic network size in high learning speed. Our results also showed that the SONFIN's performance is not significantly affected by the size of training set. The ATF-based SONFIN achieved higher recognition rate than the TF-based robust algorithm by about 5%. It also reduced the recognition error rate due to endpoint detection to about 10%, compared to an average of approximately 30% obtained with the TF-based robust algorithm, and 50% obtained with the modified version of the Lamel et al. algorithm. Commonly used robust word boundary detection algorithms always assume that the background noise level is fixed. In fact, the background noise level may vary during the procedure of recording. The speech signal is further complicated by nonstationary backgrounds where there may exist concurrent noises due to movements, engine running, speed change, braking, slams, etc. This is the major reason that most robust word boundary detection algorithms cannot work well in the condition of variable background noise level. To solve this problem, we proposed a minimum mel-scale frequency band (MiMSB) parameter which can estimate the varying background noise level by adaptively choosing one band with minimum energy from the mel-scale frequency bank. With the MiMSB parameter, some preset thresholds used to find the boundary of word signal are no longer fixed in all the recording interval. These thresholds will be tuned according to the MiMSB parameter. We also proposed an enhanced time-frequency (ETF) parameter, and it can extract useful frequency information from mutiband analysis in variable noise-level environment. Based on the MiMSB and ETF parameters, we finally proposed a new method (MiMSB-ETF-based algorithm) for word boundary detection in variable noise-level environment. This MiMSB-ETF-based algorithm achieved higher recognition rate than the TF-based robust algorithm by about 5% in variable background noise level condition. It also reduced the recognition error rate due to endpoint detection to 25%, compared to an average of 34% obtained with the TF-based robust algorithm. In addition, we proposed a refined time-frequency (RTF) arameter to improve the ETF parameter. This RTF parameter can extract more useful frequency information than ETF parameter. Based on this RTF parameter, we further proposed a new word boundary detection algorithm by using a recurrent self-organizing neural fuzzy inference network (RSONFIN). Since RSONFIN can process the temporal relations, the proposed RTF-based RSONFIN algorithm can find the variation of the background noise level and detect correct word boundaries in the condition of variable background noise level. As compared to normal neural networks, the RSONFIN can always find itself an economic network size with high learning speed. Due to the self-learning ability of RSONFIN, this RTF-based RSONFIN algorithm avoids the need for empirically determining ambiguous decision rules in normal word boundary detection algorithms. Experiments in white noise show that this new algorithm achieves higher recognition rate than the TF-based robust algorithm by about 12% in variable background noise level condition. Commonly used single-channel subtractive-type speech nhancement algorithms always assume that the background noise level is fixed or slowly varying. In fact, the background noise level may vary quickly. This condition usually results in wrong speech/noise detection, and this wrong detection results in wrong speech enhancement process. To solve this problem, we proposed a new subtractive-type speech enhancement scheme. This new enhancement scheme uses the RTF-based RSONFIN algorithm developed by us previously to detect the word boundaries in the condition of variable background noise level. In addition, a new parameter (MiFre) is used to extract more useful background noise level information than MiMSB parameter. Based on this MiFre parameter, the noise level information used for the new subtractive-type speech enhancement scheme can be estimated not only during speech pauses but also during speech segments. Finally, this new subtractive-type speech enhancement scheme has been tested and found to perform well not only in variable background noise level condition but also in fixed background noise level condition.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT880591096
http://hdl.handle.net/11536/66329
顯示於類別:畢業論文