標題: 深層長短期記憶網絡用於語音辨識之研究
Deep Long Short-Term Memory Networks for Speech Recognition
作者: 馬俊力
Alim Misbullah
簡仁宗
Chien, Jen-Tzung
電機資訊國際學程
關鍵字: 聲學模型;語音辨識;前饋式類神經網路;長短期記憶模型;acoustic modeling;speech recognition;feedforward neural network;long short-term memory
公開日期: 2015
摘要: 基於深度學習技術的語音辨識系統已證實能顯著提升語音辨識正確率。利用前饋式類神經網路(Feedforward Neural Network, FNN)或遞迴式類神經網路(Recurrent Neural Network, RNN)是近年來實現深層學習(Deep Learning)並建立聲學模型(Acoustic Model)常見的方法。前饋式類神經網路是透過多層非線性轉換提取深層抽象(Abstraction)且具有不變性(Invariance)的特徵,然而遞迴式類神經網路則利用遞迴的方式獲得時間序列(Temporal)資料的潛在資訊。長短期記憶(Long-Short Term Memory, LSTM)模型可以有效儲存歷史資訊,並被證實比傳統遞迴式類神經網路能更為有效的處理時間序列資料中間隔較長的重要資訊。本篇論文結合前饋式與遞迴式類神經網路的優點,提出具新穎性之深層長短期記憶類神經網路架構,實現包含FNN-LSTM、LSTM-FNN、LSTM-FNN-FNN以及LSTM-FNN-LSTM等不同的串聯模組,並根據這些串聯模組堆疊出更深層的類神經網路架構。在實驗評估中,我們使用卡爾迪 (Kaldi) 語音辨識軟體實現本論文所提出之深層架構。在第三屆CHiME Challenge及 Aurora-4語音資料庫的實驗結果顯示混合前饋式類神經網路及長短期記憶模型的深層架構可以有效提昇在雜訊環境下的語音辨識率。
Speech recognition has been significantly improved by applying acoustic models based on the deep neural network (DNN) which could be realized as the feedforward neural network (FNN) or the recurrent neural network (RNN). FNN is feasible to project the observations onto a deep invariant feature space while RNN is beneficial to capture the temporal information in sequence data. RNN based on the long short-term memory (LSTM) is capable of memorizing the inputs over a long time period and thus exploiting a self-learnt amount of long-range temporal context. By considering the complimentary FNN and RNN in their modeling capabilities, we present a new architecture of DNN model which is constructed by cascading LSTM and FNN in different ways and stacking the cascades of (1) FNN-LSTM, (2) LSTM-FNN, (3) LSTM-FNN-FNN and (4) LSTM-FNN-LSTM in a deep model structure. Through the cascade of the LSTM cells and the fully-connected feedforward units, we build the deep long short-term memory network which explores the temporal patterns and summarizes the long history of previous inputs in a deep learning machine. In the experiments, different architectures and topologies are investigated by using open-source Kaldi toolkit. The experiments on 3rd CHiME challenge and Aurora-4 show that the stacks of hybrid LSTM and FNN outperform the stand-alone FNN and LSTM and the other hybrid systems for noisy speech recognition.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070256152
http://hdl.handle.net/11536/127299
Appears in Collections:Thesis