標題: 深層分解及變異學習於語音辨識之研究
Deep Factorized and Variational Learning for Speech Recognition
作者: 沈辰
簡仁宗
Shen, Chen
Chien, Jen-Tzung
電信工程研究所
關鍵字: 語音識別;深層學習;矩陣分解;變異型自編碼網路;類神經網路;deep learning;speech recognition;matrix factorization;variational aoto-encoder;neural network
公開日期: 2016
摘要: 深層類神經網路(Deep Neural Network)目前廣泛運用在語音識別之中,以深層類神經網路模型取代傳統的高斯混合模型(Gaussian Mixture Model)建立的聲學模型(Acoustic Model)可以大幅度提高語音識別系統的能力。一般的類神經網路結構在語音識別的系統中有一定的局限性,如何創造出新型結構的神經網路並使之有特殊的能力是一項重要的研究議題。本文旨在提出適合多通路數據的模型,和具有隨機性的模型。我們提出了基于遞迴類神經網路(Recurrent Neural Network)和深層類神經(DNN)的網路分解式類神經網路和變異型類神經網路。遞迴類神經網路還可以引申到長短時記憶類神經網路(LSTM),他們都有基於時間序列的遞迴結構。長短時記憶類神經網路在遞迴神經網路的基礎上加入了記憶元件(memory cell),輸入閘(input gate),遺忘閘(forget gate),輸出閘(output gate),一定程度上解決了梯度消散的問題(gradient vanishing)。 傳統的深層類神經網路會把輸入轉為向量,對於上下文相關的數據,時間序列的信息會丟失,為了解決這個問題,我們提出了分解式類神經網路(Factorized Neural Network)。矩陣分解式類神經網路(Matrix Factorized Neural Network),是一種普遍化的基於向量的類神經網路,透過張量或矩陣分解(tensor factorization,matrix factorization)與類神經網路分類的結合,它可以萃取出多通道的特徵(multi-way feature)並進行分類。分解式類神經網路有效整合塔克拆解(Tucker decomposition)與傳統單通路(one-way)類神經網路分類器,傳統類神經網路的仿射變換(affine transformations)因此被塔克拆解(Tucker decomposition)所取代。這種神經網路也能和長短時記憶神經網路(LSTM)結合,讓模型更好的萃取到資料不同時間規模(time scale)的訊息。更高維的語音數據也可以用張量分解式類神經去(Tensor Factorized Neural Network)訓練。 另一方面,我們提出了基於遞迴類神經網路的一種隨機網路,變異遞迴神經網路(variational recurrent neural network, VRNN)。這種類神經網路在傳統的類神經網路上加入了隨機性,能提升語音識別的效能。我們發展的變異型循環類神經網路是從變異型自編碼網路(variational auto-encoder, VAE)和遞迴類神經網路(RNN)的基礎下發展而來。我們在訓練過模型的過程中使用了一種基於採樣的反傳導方法,隨機反傳導(stochastic back-propagation)演算法。這個模型將每個時間下的隱藏狀態(hidden state)導入變異型貝氏推論(variational Bayesian inference)。然而在推論過程中,我們會遇到不可解析的事後機率(posterior)和期望值(expectation),因此我們利用一個推論型網路(inference network)來處理這個問題。對於監督式學習的情況,我們推導出的下限(lower bound)包含了兩個部分,第一個部分是事後幾率分佈和變異推導分佈(variational distribution)的KL散度(Kullback-Leiblier divergence),第二部分是網路輸出和標籤的相互熵(cross entropy)。和傳統的循環類神經網路相比,我們提出的隨機網路模型能利用隨機變數去模擬網路隱 藏狀態中的不確定性,有助於我們分析類神經網路之變異性。 在實驗的評估中,我們使用了開源的工具庫Kaldi和Theano。并使用了timit和aurora4數據庫去評估不同模型在語音識別下的效能。
Deep neural network (DNN) has been recognized as a new trend for modern speech recognition. Many extensions and realizations have been developing to further improve the system performance and discover the meaningful insights from different perspectives. This study aims to explore the structural information from multi-way speech observations and incorporate the stochastic point of view into representation learning. We present the factorized and variational learning for speech recognition based on the recurrent neural network (RNN) where the hidden state from neighboring time steps is merged as a memory for cyclic and temporal modeling. Such an RNN model is also extended to the realization of long short-term memory (LSTM) where a number of gates and cells are introduced to capture long time dependency. LSTM can also reduce the exploding and vanishing problems in training procedure of RNN. Two new types of models based on matrix factorized neural network (MFNN) and variational recurrent neural network (VRNN) are proposed. First of all, we deal with the constraint of system capability of conventional DNN caused by the loss of contextual correlation in temporal and spatial horizons due to unfolding the temporal-frequency observation matrices into vector-based inputs. MFNN is a generalization of vector-based neural network (NN) which performs the matrix factorization and nonlinear activation for input matrices in the layer-wise forward computation. Affine transformation in NN is replaced by Tucker decomposition in MFNN. Such a calculation does not only preserve the spatial information in frequency domain but also extract the temporal pattern in time domain. In this study, a deep model based on MFNN is built by cascading a number of factorization layers with fully-connected layers before connecting to the softmax outputs for speech recognition. This model is further extended to the matrix factorized LSTM where the multiplications in input gate and output gate are replaced by Tucker decomposition. Multiple acoustic features are also considered as the tensor input to carry out the tensor-factorized NN (TFNN). On the other hand, we propose an VRNN for acoustic modeling which is seen as a stochastic realization of RNN. By reflecting the nature of stochastic property in RNN, we can improve the representation capability as well as the speech recognition performance. To do so, we conduct the variational inference for latent variable model based on RNN. Motivated by the variational auto-encoder (VAE), we carry out a new type of stochastic back-propagation algorithm where a sampling method is used for efficient implementation and approximation in training procedure. In this recurrent VAE, we introduce the class targets and optimize the variational lower bound for supervised RNN which is composed of two parts. One is the Kullback-Leiblier divergence between posterior distribution and variational distribution of latent variables. The other one is the cross entropy error function for network outputs and class targets. Beyond traditional RNN, the proposed VRNN characterizes the dependencies between latent variables across subsequent time steps. In the experiments, we carry out the proposed methods by using the open-source Kaldi toolkit. The multi-way feature extraction ability of MFNN will be illustrated by showing the phenomenon of hidden neurons due to the factor weights in individual ways. The word error rate (WER) for speech recognition using TIMIT and Aurora-4 will be reported.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070360309
http://hdl.handle.net/11536/139702
Appears in Collections:Thesis