標題: 深層變異與流形學習於語者辨識之研究
Deep Variational Manifold Learning for Speaker Recognition
作者: 徐正威
簡仁宗
Hsu, Cheng-Wei
Chien, Jen-Tzung
電機工程學系
關鍵字: 流形學習;非線性降維法;深層類神經網路;變異推論;鑑別式學習;語者辨識;manifold learning;nonlinear dimension reduction;deep neural network;variational inference;discriminative learning;speaker recognition
公開日期: 2016
摘要: 在語者辨識系統中,一般會使用 i-vector 做為語者特徵值並用機率型線性鑑別分析法(probabilistic linear discriminant analysis, PLDA)做為評分量測。在眾多機器學習法中,此做法已達到最好的語者辨識效能。然而機率型線性鑑別法為線性模型,在訓練過程中,必須假設同一語者會共享同一個低維度空間的潛在變數,且所有語者的i-vector 均可透過潛在變數線性轉換並表示在此低維度空間中。機率型線性鑑別法是透過最大期望(expectation-maximization, EM)演算法最佳化此模型。最大期望演算法會對整批訓練資料最大化似然機率,沒有類別資訊因此不具備鑑別式訓練能力。一般而言,透過機率型線性鑑別分析法所訓練而成的語者模型,會受到若干限制,其中包括語者特徵的線性假設、語者特徵代表性偏弱、語者特徵為高維度資料、非鑑別式訓練與批次式學習。在本研究中,我們提出深度類神經網路進行流形學習並克服上述限制。透過變異自編碼器(variational auto-encoder, VAE)所建立的深度潛在變數模型,實現出鑑別式流形學習並對語者辨識做評分。 流形學習可以對高維度空間的觀測資料(例如: i-vector)透過相鄰相嵌法(neighbor embedding)將鄰近資料在低維度空間之幾何關係轉換成目標函數並最佳化。此目標函數的建立是根據訓練語句之語者標籤,促使同語者資料在低維度空間的彼此吸引,並且使不同語者資料彼此排斥。本論文為了要強化辨識系統的效能,將監督式學習實踐於深度潛在變數模型(deep latent variable model)。此做法有兩方面考量:第一點,透過深度類神經網路有效表達語者內(intra-speaker)與語者間(inter-speaker)的複雜特徵關係;第二點,使用潛在變數模型建立潛在結構,將語者資料分成許多的小量資料(mini-batch)透過隨機梯度下降法(stochastic gradient descent)進行模型優化,並補償深度模型的不確定性。基於以上考量,我們提出深度變異流形學習於語者辨識,其中變異推論法適用於最佳化潛在變數模型,值得注意的是我們基於變異自編碼器提出全新的流形學習架構。變異自編碼器由編碼器(encoder)與解碼器(decoder)所構成,其中編碼器是將原有觀察到的資料轉換成潛在變數,解碼器是將潛在變數投影回原始資料空間。高維度觀測資料與低維度潛在變數間的非線性關係,能透過學習過程有效反映出相同語者內與不同語者間的語者特徵與通道特性。我們使用變異自編碼器的解碼步驟來取代線性鑑別式模型中的線性轉換,並透過變異型推論(variational inference)對訓練資料最大化其生成機率變異性下界(variational lower bound),估測出類神經網路中潛在變數的平均值與變異數,而所有的語者都將共享同一個類神經網路。為實現此模型,我們引入柏努力分布(Bernoulli distribution)的變數來代表任意兩個i-vector之間的類別資訊,並計算出對低維度潛在變數中同語者的吸引力與不同語者的排斥力。我們提出一個以類神經網路為編碼過程,線性轉換為解碼過程的隨機相鄰相嵌,透過深層流形學習來建造混合生成與鑑別模型並應用於語者辨識。此研究以NIST i-vector Machine Learning Challenge語者資料庫來評估成果。
Traditionally, speaker recognition system using i-vector as the speaker feature vector and the probabilistic linear discriminant analysis (PLDA) as the scoring function has achieved state-of-the-art performance in many tasks. PLDA is seen as a linear model which is trained under the assumption that the same speaker shares a common low dimensional latent variable space where i-vectors of all speakers are transformed and represented in this space. No discriminative learning is performed. PLDA is estimated according to the expectation-maximization (EM) algorithm by maximizing the likelihood using a whole set of training data. Basically, the speaker model based on PLDA may be constrained due to the linear assumption, shallow representation, high dimensionality, non-discriminative and batch learning. In this study, we propose a deep manifold neural network to deal with these constraints. A deep latent variable model based on the variational auto-encoder is incorporated to conduct the discriminative manifold learning and scoring for speaker recognition. Manifold learning aims to learn for a low-dimensional representation from its high-dimensional observation data, e.g. i-vector, where the objective for neighbor embedding is optimized. Speaker label can be introduced to enforce those observations in low-dimensional space to be close within the same speaker and apart across different speakers. To further strengthen the system performance, such a supervised manifold learning can be realized as a deep latent variable model due to twofold considerations. First, deep neural network is considered to reflect the complicated characteristics within speakers and between speakers. Secondly, a latent variable model is considered to explore the latent structure and compensate the uncertainty region of a deep model which is trained via a stochastic back-propagation algorithm from mini-batches of speaker utterances. To tackle these considerations, we develop a deep variational manifold learning for speaker recognition. The variational inference is implemented to carry out a latent variable speaker model. Importantly, we develop a new manifold learning framework based on a variational auto-encoder (VAE). This VAE consists of an encoder which transforms the original data into latent variables and a decoder which projects the latent features back to the recovered data. The nonlinear mapping between high-dimensional observation and low-dimensional latent variable is learned to faithfully reflect intra and inter speaker characteristics. We also mimic a PLDA model by imposing a linear transformation in the decoding step of VAE. The means and variances of latent variables are estimated from the training i-vectors by maximizing the variational lower bound of log likelihood function. A shared neural network for different speakers is established accordingly. In particular, we introduce a Bernoulli variable to indicate the class information of each pair of i-vectors and use this latent variable to express the attraction and the repulsion for those low-dimensional samples within the same speaker and between two different speakers, respectively. Correspondingly, we build a stochastic neighbor embedding approach by using a neural network as encoder and a linear transform as decoder. A hybrid generative and discriminative model is constructed for deep manifold learning and applied for i-vector based speaker recognition. The proposed method is evaluated by the experiments on speaker recognition based on the NIST i-vector Machine Learning Challenge.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070350724
http://hdl.handle.net/11536/139674
顯示於類別:畢業論文