標題: | 應用於數位助聽器之影像輔助語音活動偵測 Voice Activity Detection with Aid of Images for Digital Hearing Aids System |
作者: | 陳翊安 周世傑 Chen, Yi-An Jou, Shyh-Jye 生醫工程研究所 |
關鍵字: | 活動語音偵測;影像;聲音;機器學習;支持向量機;VAD;audio-visual;machine learning;SVM |
公開日期: | 2015 |
摘要: | 摘要
助聽器因為要及時回饋處理過的語音訊號給佩戴者,因此計算不能太複雜。在現在的助聽器系統中有兩個主要的問題需要去克服。 第一個問題是在嘈雜的環境下(尤其背景聲音是講話的雜訊),我們對語音的理解度會下降。第二個問題是助聽器麥克風與喇叭間的回授導致處方籤能給的增益不能太大,此外,它也會造成輸出聲音的振盪。 語音活動偵測(VAD)是回授消除與雜訊壓抑的關鍵因素。 在訊雜比低或背景雜訊像聲音(例如人聲雜訊)時,單純用聲音判斷現在是否為語音的準確率會下降。 我們在這篇論文利用人在講話時嘴唇的影像特徵的幫助判斷現在是否有語音,藉此來做更好的雜訊壓抑。
影像特徵不被聲音雜訊所干擾,因此影像特徵能被提取並應用在吵雜的環境中。因為影像的提取需要額外的裝置以及運算,因此只有在雜訊大的時候才值得去開啟這個功能。這篇論文提出的方法是利用偵測對話者的嘴唇影像,藉由提取的影像特徵與語音特徵一起經過支持向量機器學習(SVM)訓練好的模型輸出結果
,加上keeper的判斷,得出最後的VAD。影像的特徵可不受聲音雜訊干擾,在嘈雜的環境下也能運作,但因為運算複雜度大,因此功能是在背景雜訊大時再開啟才有其價值。
在經驗上VAD準確率到80%以上就已經足夠好了,因此我們影像VAD的研究也以此為基準做為目標。 根據實驗出來的結果顯示,藉由影像幫助的VAD在影像訊號良好且在連續語音的情況下,針對不同人做SVM訓練與測試的影像VAD的準確率能達到89%,而在斷斷續續的語音情況下,準確率達到84%。 此外,我們利用桌上型電腦去架構語音和影像混合的VAD,作即時的雜訊抑制模擬。 Because HA needs to response the speech signals that had been processed to the person who wears it in real-time, it cannot be too complex. There are two main problems needed to be conquered now. The first problem is the recognition about speech is worse in noisy environment (especially background noise is speech-alike noise). Secondly, the feedback phenomenon between speaker and microphone will cause the limitation of the gain in prescription and moreover, it may cause oscillation. Voice activity detector (VAD) is the key factor in noise reduction and feedback cancellation. The accuracy of VAD determined by audio features is worse when the SNR is low or background noise is like speech, for example, babble noise. The visual features of lips are used in this thesis to help to improve VAD in low SNR noise environment and to do better noise reduction. The visual features are not disturbed by audio noise, so they can be used as VAD features in noisy environment. The extracting of visual features cost extra devices and computation, so it is worth to turn on it when SNR is low. This thesis proposes to extract the visual features of lips and then to combine them with audio features to Support Vector Machine model (SVM) that is trained already. Then the VAD results of SVM will pass through the judgment of keeper which is used to get smoother VAD result to get final VAD result. Empirically, the VAD accuracy is considered good enough if it reaches 80%. So the goal of this thesis is that the VAD accuracy can reach 80% even in low SNR environment with using aid of visual features. The implementation results show that the VAD accuracy can reach 89% when illumination is high enough and light is white light and continuous speech condition and can reach 84% when speech is not continuous condition. Moreover, we will use a PC system to do real-time noise cancellation using our proposed method. |
URI: | http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070256727 http://hdl.handle.net/11536/138629 |
Appears in Collections: | Thesis |