標題: 使用時頻變化調變之多工學習深度信念網路的語音情緒辨識
Multi-Task Learning based Deep Belief Network for Speech Emotion Recognition using Spectro-Temporal Modulations
作者: 翁恪誠
Wheng, Ko-Cheng
冀泰石
Chi, Tai-Shih
電信工程研究所
關鍵字: 情緒辨識;多工學習;深度信念網路;時頻變化調變;emotion recognition;multi-task learning;deep belief network;spectro-temporal modulations
公開日期: 2014
摘要: 語音情緒辨識在近年來成為一熱門的研究題目,同時機械學習中的深層學習(Deep Learning)也從2007年開始重新受到關注並應用於各領域中。在本論文中,我們使用深度信念網路(Deep Belief Network)作為情緒辨識器,探討由一聽覺感知模型所萃取出的時頻變化的參數(joint Rate-Scale features, RS features)在有雜訊的條件下,在情緒辨識上的效能。實驗中,我們將德語的柏林情緒語料庫(Berlin Emotional Database)加入不同訊雜比(Signal to noise Ratio)的白雜訊(white noise)以及人聲雜訊(babble noise),同時與其他參數(Inter384)在不同訊雜比下做辨識效能的比較。並且,我們也比較深度信念網路和支援向量機(Support Vector Machine)的語音情緒辨識效能。除此之外,我們提出以多任務學習(Multi-Task Learning)的想法應用在情緒辨識的深度學習上。我們加入另一種語言(英語)的資料庫(eNTERFACE 2005 Emotional Database)作為多任務學習中的一部分,產生兩個語言複合成的情緒辨識系統。我們認為辨識器能夠在不同語言的學習任務中,獲得比單一任務更多的資訊以增進辨識率。實驗結果顯示,時頻變化參數較Inter384參數有較高的辨識率,也發現在使用時頻變化參數下,深度信念網路較支援向量機有較優的效能。最後,我們觀察到使用多任務學習架構的深度信念網路會比原始的單一語言架構有更佳的表現。
Speech emotion recognition is a popular research topic from the last decade. Meanwhile, since the revival of deep learning in 2007, it has been adopted in various research fields. In this thesis, we use a deep belief network (DBN) as the classifier and examine its performance in detecting emotion states of noisy speech signals using rate-scale features (RS features) extracted from an auditory model. The noisy speech is derived by adding white and babble noises to clean utterances from the Berlin Emotional Speech database under various SNR levels. Afterward, the official feature set (Inter384) used in INTERSPEECH 2009 Emotion Challenge and a conventional support vector machine (SVM) classifier are considered for comparisons with the RS feature set and the DBN classifier, respectively. Furthermore, we propose an extended architecture of DBN based on the concept of multi-task learning (MTL) by adding a task of recognizing a different language (eNTERFACE 2005 Emotional Database) into the system. We postulate that one task could help speech emotion recognition performance of the other task. Simulation results demonstrate that (1) RS features yield higher recognition rates than Inter384 features; (2) DBN outperforms SVM using the RS features; (3) MTL-based DBN produces higher recognition rates than the original DBN.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070160236
http://hdl.handle.net/11536/76543
Appears in Collections:Thesis