标题: | 时频域调变之理想二元遮罩和语音理解度 Spectro-Temporal Modulation Based Ideal Binary Mask and Speech Intelligibility |
作者: | 黄盈谙 冀泰石 Huang, Ying-An Chi, Tai-Shih 电信工程研究所 |
关键字: | 理想二元遮罩;语音理解度;时频域调变;ideal binary mask;speech intelligibility;spectro-temporal modulation |
公开日期: | 2015 |
摘要: | 本论文探讨以不同能量所建构之理想二元遮罩对于语音理解度的影响,考虑之能量种类包含整体能量、时域调变能量、频域调变能量及联合时频域调变能量。调变能量已被证明与语音理解度高度相关,本论文藉由不同的截止频率对各种调变做低通滤波,并以心理声学实验来探讨各种调变所建构之理想二元遮罩对中文语音理解度的影响。 接着本论文探讨以简易的深层神经网路(DNN)作为语音分离方法的效果,我们以已知理想二元遮罩作为DNN的预测目标,以原始频谱图和多种调变能量后的频谱图作为DNN的输入资讯,用以估计遮罩,并验证分离语音的语音理解度与心理声学实验结果有一致的趋势。因为DNN的特性之一是能由观测资料中自动撷取出适当的特征参数,不需以人工方式选取特征参数,所以我们使用各种原始频谱图做为DNN的输入,而针对音框数目,我们选用五个音框的原始频谱图作为比较的基底依据,也考虑了只有一个音框的情况来与[47]文献结果做比较。更进一步的,我们探讨在不同调变能量下取用不同时间范围内的资讯作为输入,对遮罩估计效果的影响。 In this thesis, we discuss the effect of different kinds of ideal binary masks (IBMs) on speech intelligibility. These IBMs are derived based on the total energy, temporal modulation energy, spectral modulation energy, and spectro-temporal modulation energy. Modulation energy has been shown highly related to speech intelligibility. In this thesis, we systematically apply modulation low-pass filters to modify the modulations of spectrogram and consider the modulation SNR in constructing different types of IBMs. Psychoacoustic experiments are conducted to investigate the effect of modulation IBMs on Mandarin speech intelligibility. Later on, we used a simple deep neural network (DNN) as the classifier to estimate the IBM for monaural speech separation. The original spectrogram and the spectrograms filtered by different modulation filters are used as input features of the DNN to estimate the IBM.Objective speech intelligibility scores of the separated speech signals share a similar trend with the subjective speech intelligibility scores from psychoacoustic experiments. One of DNN’s characteristics is that it can automatically retrieve appropriate features from the observations such that we do not need to define and extract features from speech signals. In this thesis, we use different spectrograms as the input of the DNN. Five frames of each spectrogram are cascaded as the input vector to the DNN and the corresponding estimation performance is considered the baseline. To compare with [47], we also use one frame from each modulation spectrogram to estimate the IBM. Moreover, we discuss the effect on estimation performance when adopting different time spans for different modulation energy spectrograms as the DNN input. |
URI: | http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070260264 http://hdl.handle.net/11536/143247 |
显示于类别: | Thesis |