標題: | 時頻域調變之理想二元遮罩和語音理解度 Spectro-Temporal Modulation Based Ideal Binary Mask and Speech Intelligibility |
作者: | 黃盈諳 冀泰石 Huang, Ying-An Chi, Tai-Shih 電信工程研究所 |
關鍵字: | 理想二元遮罩;語音理解度;時頻域調變;ideal binary mask;speech intelligibility;spectro-temporal modulation |
公開日期: | 2015 |
摘要: | 本論文探討以不同能量所建構之理想二元遮罩對於語音理解度的影響,考慮之能量種類包含整體能量、時域調變能量、頻域調變能量及聯合時頻域調變能量。調變能量已被證明與語音理解度高度相關,本論文藉由不同的截止頻率對各種調變做低通濾波,並以心理聲學實驗來探討各種調變所建構之理想二元遮罩對中文語音理解度的影響。
接著本論文探討以簡易的深層神經網路(DNN)作為語音分離方法的效果,我們以已知理想二元遮罩作為DNN的預測目標,以原始頻譜圖和多種調變能量後的頻譜圖作為DNN的輸入資訊,用以估計遮罩,並驗證分離語音的語音理解度與心理聲學實驗結果有一致的趨勢。因為DNN的特性之一是能由觀測資料中自動擷取出適當的特徵參數,不需以人工方式選取特徵參數,所以我們使用各種原始頻譜圖做為DNN的輸入,而針對音框數目,我們選用五個音框的原始頻譜圖作為比較的基底依據,也考慮了只有一個音框的情況來與[47]文獻結果做比較。更進一步的,我們探討在不同調變能量下取用不同時間範圍內的資訊作為輸入,對遮罩估計效果的影響。 In this thesis, we discuss the effect of different kinds of ideal binary masks (IBMs) on speech intelligibility. These IBMs are derived based on the total energy, temporal modulation energy, spectral modulation energy, and spectro-temporal modulation energy. Modulation energy has been shown highly related to speech intelligibility. In this thesis, we systematically apply modulation low-pass filters to modify the modulations of spectrogram and consider the modulation SNR in constructing different types of IBMs. Psychoacoustic experiments are conducted to investigate the effect of modulation IBMs on Mandarin speech intelligibility. Later on, we used a simple deep neural network (DNN) as the classifier to estimate the IBM for monaural speech separation. The original spectrogram and the spectrograms filtered by different modulation filters are used as input features of the DNN to estimate the IBM.Objective speech intelligibility scores of the separated speech signals share a similar trend with the subjective speech intelligibility scores from psychoacoustic experiments. One of DNN’s characteristics is that it can automatically retrieve appropriate features from the observations such that we do not need to define and extract features from speech signals. In this thesis, we use different spectrograms as the input of the DNN. Five frames of each spectrogram are cascaded as the input vector to the DNN and the corresponding estimation performance is considered the baseline. To compare with [47], we also use one frame from each modulation spectrogram to estimate the IBM. Moreover, we discuss the effect on estimation performance when adopting different time spans for different modulation energy spectrograms as the DNN input. |
URI: | http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070260264 http://hdl.handle.net/11536/143247 |
Appears in Collections: | Thesis |