標題: 深度對抗式學習於語者辨識之研究
Deep Adversarial Learning for Speaker Recognition
作者: 彭康庭
簡仁宗
Peng, Kang-Ting
Chien, Jen-Tzung
電機工程學系
關鍵字: 深度類神經網路;對抗式學習;機率線性鑑別分析;i-vector;資料擴增;流形學習;鑑別式學習;語者辨識;deep neural network;adversarial learning;probabilistic linear discriminant analysis;i-vector;data augmentation;manifold learning;discriminative learning;speaker recognition
公開日期: 2017
摘要: 近幾年,透過 i-vector 來表達語者的語句非常成功的使用在語者識別領域,這種表達方式提供了一種將任意持續時間的語者語句映射成固定長度和低維度向量的方法,並保持語者的資訊,由 i-vector作為語者特徵值並用機率線性鑑別分析法 (PLDA) 是目前在語者識別領域中達到最佳的性能。然而,這樣的做法仍然存在兩個缺限制了使用i-vector與PLDA模型結合的性能。首先,對於每一個語者所擁有的訓練資料數量是不平衡的,對於某些語者的資料是不充足的。透過資料擴增的方法將樣本數較少的資料作補償使得收集的資料是平衡的。第二,機率線性鑑別分析模型假設潛在變量為線性的,此線性假設並不足以用來處理資料間複雜的非線性關係。此外,機率線性鑑別分析法被建立為生成式模型並為包含鑑別式學習。本研究中,我們提出透過深度對抗式學習方式來解決這樣缺陷並使用放寬這些假設來提升語者辨識系統的性能。深度對抗式學習目的是透過雙人遊戲來學習一個生成對抗網路,這個雙人遊戲是個極小化極大 (minimax) 最佳化的問題。我們提出一個新的對抗生成網路用於資料擴增並且用於非線性子空間的學習運用在語者辨識上。 首先,我們提出了一個對抗學習網路解決方案來,透過生成式模型可以人為地的生成 i-vector 來解決某些語者訓練樣本不足的問題達到整個訓練資料集的平衡。生成對抗網路包含了兩個類神經網路,分別為生成器 (Generator) 與判別器 (Discriminator)。判別器負責判斷輸入的樣本是來自於真實的或生成的資料。生成器目的是要生成出似是而非的資料使得判別器無法區分是來自於真實的資料或生成的資料。對於生成對抗網路達到與者資料擴增上,我們採用了兩個實現方式。第一,使用輔助分類器對抗網路 (AC-GAN) 的實現,這種實現方式透過類別標籤生成出對應的類別 i-vector,其次,我們基於自動編碼器 (Auto-encoder) 來實現新的對抗生成網路,編碼器 (Encoder) 將現有的 i-vector 映射到潛在空間。判別器使用來自這個潛在空間的潛在特徵來區分輸入是來自於真實 i-vector或是生成的 i-vector,解碼器 (Decoder) 由潛在特徵重建 i-vector。另一方面,通過將合輔助分類器對抗網路與語者辨識系統結合加強資料擴增的效果,我們把判別器視為一個語者辨識系統,新的生成對抗網路中加入對於真實資料與生成資料的潛在特徵餘弦相似性的量測,根據極小化極大最佳化問題下,判別器必須去最小化的餘弦相似性的量測,相對地,生成器則是最大化。因此,我們可以生成出未在資料集中的資料以解決在語者辨識中資料不平衡的問題。 接著,我們建立一個語者辨識系統基於 i-vector與機率線性鑑別分析模型,經由深度對抗式學習達到i-vector之非線性與鑑別性的子空間學習。通常來說,子空間學習的目的是學習一個 i-vector之低維度流形,這個流形將資料中相鄰的資訊保留在低維度空間上。此子空間學習的目標函數是根據語者標籤所建構,此作法可以將低維空間中的相同語者觀察資料靠近,並在不同語者分開。特別地,本次研究透握對抗流形學習來建構語者辨識系統並基於 i-vector 與機率線性鑑別分析模型。學習高維觀測資料與低維潛在變數之間的非線性映射,以反映語者內 (intra) 與語者間 (inter) 的特性。 此模型是由編碼器、解碼器、判別器所組成,透過編碼器找尋潛在變數,解碼器由潛在變數重建 i-vector。對抗流形學習在深度學習與潛在變數模型結合。因此,根據對抗式學習與鄰近嵌入法來建構低維潛在空間。基於 i-vector與機率線性鑑別分析模型的對抗流形學習法制定了三個目標函數,包含基於機率線性鑑別分析模型的重構錯誤、用於子空間學習之鄰近嵌入、判別器與生成器所引起的對抗損失。在此模型中,編碼器就是在生成對抗網路中的生成器,生成器被訓練可以產生出潛在空間中的樣本且判別器無法判別。其中編碼器、解碼器、判別器的參數皆是透過將語者資料分成許多小量資料 (mini-batch) 使用隨機梯度下降法更新。此研究以 NIST i-vector Machine Learning Challenge 語者資料庫來評估成果。實驗結果展現了基於對抗式學習資料擴增與子空間學習的效果。
In recent years, i-vector representation of speaker utterances has been successfully developed for speaker recognition. This representation provides a way to map an arbitrary duration of speaker utterances into a fixed-length and low-dimensional vector that preserves the information of speaker identity. The probabilistic linear discriminant analysis (PLDA) based on i-vector currently dominates the research field due to its state-of-the-art performance for speaker recognition. However, there are still twofold weaknesses, which constrain the performance of using i-vector combined with PLDA model. First, number of training utterances is unbalanced among different speakers. In particular, the development data of some speakers are insufficient. Data augmentation is an important vehicle to balance the distribution of data collection and compensate the small sample problem for a reliable model. Second, the construction of latent variable model based on PLDA assumes that the latent variables are linear and their dimensionality is unchanged. The complex and nonlinear relations in utterances of a speaker are not characterized. PLDA is constructed as a generative model without discriminative learning. This study presents deep adversarial learning to compensate these weaknesses and relax different assumptions to improve system performance for speaker recognition. Deep adversarial learning aims to learn a generative adversarial network (GAN) based on a two-player game with a minimax optimization problem. We develop new GANs for data augmentation as well as nonlinear subspace learning and systematically apply them for speaker recognition. First, we present a GAN solution to carry out a generative model and artificially generate i-vectors to tackle small sample size problem in presence of an unbalanced distribution of training data over different speakers. GAN consists of two neural networks for generator and discriminator. The discriminator is a classifier that determines whether a given sample looks like a real sample from the dataset or like an artificially created sample. The generator attempts to generate plausible samples that the discriminator cannot distinguish. There are two realizations of GAN for data augmentation. The first one is a realization of auxiliary classifier GAN (AC-GAN) which incorporates class label to produce the class conditional i-vectors. We implement a new GAN based on an auto-encoder where the encoder maps the existing i-vector into latent space. A discriminator is used to classify if the latent features are from a real i-vector or a generated i-vector. The decoder is then introduced to reconstruct a generated i-vector from its latent representation. On the other hand, data augmentation is realized and strengthened by tightly merging AC-GAN with speaker recognition system. In this realization, we directly treat speaker recognition system as the discriminator where the cosine similarity measure of latent features is calculated. A new GAN is implemented according to a minimax optimization where the cosine similarity measure is minimized to estimate the discriminator and is maximized to construct the generator. We therefore sample the unseen i-vectors to tackle the unbalanced sample condition in speaker recognition system. Next, deep adversarial learning is performed to conduct nonlinear and discriminative subspace learning of i-vectors for PLDA-based speaker recognition. In general, subspace learning aims to learn a low-dimensional manifold of i-vectors, which embeds and preserves neighboring information in dimensionality reduction. The learning objective is constructed according to the speaker labels, so it can enforce the observation data in low-dimensional space to be close within the same speaker and apart across different speakers. In particular, this study presents an adversarial manifold learning (AML) for speaker recognition based on PLDA using i-vectors. The nonlinear mapping between high-dimensional observation and low-dimensional latent variable is learned to reflect intra and inter speaker characteristics. AML-PLDA basically consists of an encoder for finding the latent variables and a decoder for reconstructing the i-vectors. AML is developed and incorporated in deep learning for a latent variable model. Low-dimensional latent space is therefore constructed according to an adversarial learning with neighbor embedding. This AML-PLDA is formulated to jointly optimize three learning objectives including a reconstruction error based on PLDA, a subspace learning for neighbor embedding, and an adversarial loss caused by a discriminator and a generator. Using the deep neural networks, the generator is trained to fool the discriminator with its generated samples in latent space. The parameters in encoder, decoder and discriminator are jointly estimated by using the stochastic gradient descent algorithm from mini-batches of speaker utterances. The proposed method is evaluated by the experiments on speaker recognition based on the NIST i-vector Machine Learning Challenge. The experimental results on speaker recognition show the merits of data augmentation and subspace learning based on various realizations of adversarial learning.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070450746
http://hdl.handle.net/11536/142552
顯示於類別:畢業論文