標題: | 鑑別式訓練法於語者驗證之研究 Discriminative Training Methods for Speaker Verification |
作者: | 趙怡翔 Chao, Yi-Hsiang 張瑞川 王新民 Chang, Ruei-Chuan Wang, Hsin-Min 資訊科學與工程研究所 |
關鍵字: | 語者;假說檢定;似然率;最小驗證錯誤學習;基於核心鑑別式分析;鑑別式反饋調適;speaker verification;hypothesis testing;likelihood ratio;minimum verification error training;Kernel Discriminant Analysis;Discriminative Feedback Adaptation |
公開日期: | 2008 |
摘要: | 語者驗證(speaker verification)常被表示成統計上的假說測定(hypothesis testing)問題,用似然比例 (likelihood ratio, LR)檢定的方法來解。一個語者驗證系統性能的好壞高度依賴於目標語者聲音的模型化(空假說)與非目標語者聲音的描述(替代假說)。然而,替代假說因為包含未知的冒充者,通常很難被事先描述地好。在這篇論文,我們提出一個描述替代假說的較佳架構,其目標是希望將目標語者與冒充者做最佳化的鑑別。該架構是建構在一群事先訓練好的背景語者的可用資訊的加權算術組合(weighted arithmetic combination, WAC)或加權幾何組合(weighted geometric combination, WGC)上。我們提出使用二種鑑別式訓練法來最佳化WAC或WGC的相關參數,分別是最小驗證誤差(minimum verification error, MVE)訓練法與演化式最小驗證誤差(evolutionary minimum verification error, EMVE)訓練法,希望使得錯誤接受(false acceptance)機率與錯誤拒絕(false rejection)機率都能最小。此外,我們也提出二種基於WAC與WGC的新的決策函數(decision functions),其可以被視為非線性鑑別分類器(nonlinear discriminant classifiers)。為了求解加權向量w,我們提出使用二種基於核心的鑑別技術(kernel-based discriminant techniques),分別是基於核心的費氏鑑別器(Kernel Fisher Discriminant, KFD)與支持向量機器(Support Vector Machine, SVM),因為它們擁有能將目標語者與非目標語者的樣本(samples)有效分開的能力。
在內文不相依(text-independent)語者驗證技術中,GMM-UBM系統是最常被使用的主流方法。其優點是目標語者模型與通用背景模型(universal background model, UBM) 都具有概括性(generalization)的能力。然而,因為這二種模型是分別根據不同的訓練準則所求出,訓練過程皆沒有考慮到目標語者模型與UBM之間的鑑別性(discriminability)。為了改進GMM-UBM方法,我們提出一個鑑別式反饋調適(discriminative feedback adaptation, DFA)架構,希望可以同時兼顧概括性與鑑別性。此架構不但保留了原本GMM-UBM方法的概括性能力,而且再強化了目標語者模型與UBM之間的鑑別性能力。在DFA架構下,我們不是使用一個統一的通用背景模型,而是建構一個具鑑別性的特定目標語者反模型(anti-model)。
在我們的實驗中,我們共使用XM2VTSDB、ISCSLP2006-SRE與NIST2001-SRE這三套語者驗證資料庫(database),實驗結果顯示我們所提出的方法優於所有傳統上基於LR的語者驗證技術。 Speaker verification is usually formulated as a statistical hypothesis testing problem and solved by a likelihood ratio (LR) test. A speaker verification system’s performance is highly dependent on modeling the target speaker’s voice (the null hypothesis) and characterizing non-target speakers’ voices (the alternative hypothesis). However, since the alternative hypothesis involves unknown impostors, it is usually difficult to characterize a priori. In this dissertation, we propose a framework to better characterize the alternative hypothesis with the goal of optimally distinguishing the target speaker from impostors. The proposed framework is built on a weighted arithmetic combination (WAC) or a weighted geometric combination (WGC) of useful information extracted from a set of pre-trained background models. The parameters associated with WAC or WGC are then optimized using two discriminative training methods, namely the minimum verification error (MVE) training method and the proposed evolutionary MVE (EMVE) training method, such that both the false acceptance probability and the false rejection probability are minimized. Moreover, we also propose two new decision functions based on WGC and WAC, which can be regarded as nonlinear discriminant classifiers. To solve the weight vector w, we propose using two kernel-based discriminant techniques, namely the Kernel Fisher Discriminant (KFD) and Support Vector Machine (SVM), because of their ability to separate samples of target speakers from those of non-target speakers efficiently. In recent years, the GMM-UBM system is the predominant approach for the text-independent speaker verification task. The advantage of the approach is that both the target speaker model and the impostor model (UBM) have generalization ability. However, since both models are trained according to separate criteria, the optimization procedure can not distinguish a target speaker from background speakers optimally. To improve the GMM-UBM approach, we propose a discriminative feedback adaptation (DFA) framework that allows generalization and discrimination to be considered jointly. The framework not only preserves the generalization ability of the GMM-UBM approach, but also reinforces the discriminability between the target speaker model and the UBM. Under DFA, rather than use a unified UBM, we construct a discriminative anti-model exclusively for each target speaker. The results of speaker-verification experiments conducted on three speech corpora, the Extended M2VTS Database (XM2VTSDB), the ISCSLP2006-SRE database and the NIST2001-SRE database, show that the proposed methods outperform all of the conventional LR-based approaches. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT009223812 http://hdl.handle.net/11536/76688 |
Appears in Collections: | Thesis |
Files in This Item:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.