以模組化遞迴類神經網路為基礎之中文語音辨認

標題:	以模組化遞迴類神經網路為基礎之中文語音辨認 Modular Recurrent Neural Networks-based Mandarin Speech Recognition
作者:	廖元甫 Yuan-Fu Liao 陳信宏 Sin-Horn Chen 電信工程研究所
關鍵字:	中文語音辨認;模組化遞迴類神經網路;Mandarin speech recognition;modular recurrent neural network
公開日期:	1998
摘要:	本論文探討使用遞迴類神經網路之中文語音辨認方法，首先提出一種以遞迴類神經網路為基礎之預分類器，用以將輸入語音先行區分成三種穩態包括︰聲母、韻母、靜音，以及一個暫態；而後在辨認時，縮小穩態部份的搜尋範圍，以加速隱藏式馬可夫模型辨認器的速度，由單一語者連續語音辨認實驗的結果證明，它可配合光束搜尋法使用，在只降低0.1%辨認率的情況下，將傳統光束搜尋法所需計算的高斯機率函數個數，與所需搜尋的隱藏式馬可夫模型狀態數，分別再刪掉38.7%與35.1%。其次提出一種以模組化遞迴類神經網路為基礎之國語單字音辨認器，將複雜的國語音節辨認工作切割成五個子工作，包括︰聲母辨認、韻母辨認、音調辨認、語音大分類加權/切割、及聲母分類加權，各子工作由一個專用的遞迴類神經網路完成，再將它們的輸出直接結合成音節辨認分數，完成音節辨認工作，最後並擴充此辨認器同時進行正反時間方向的辨認，由多語者單音節辨認的實驗結果顯示，它較以最小錯誤率法則訓練之先進的隱藏式馬可夫模型辨認法為優，基本音節之辨認率可由76.8%提升至82.8%，而音節之辨認率則可由 70.8%提升至76.3%。最後擴充此模組化遞迴類神經網路國語單字音辨認器，加入一個音節邊界偵測模組及一個多層次刪減辨認搜尋演算法，以進行連續國語基本音節辨認，由單一語者連續語音辨認的實驗結果顯示，它在系統複雜度與辨認率上，皆優於已經使用以最小錯誤率法則訓練之先進的隱藏式馬可夫模型辨認法，基本音節辨認率可由以最大相似度法則訓練之HMM的80.9%，與以最小錯誤率法則訓練之HMM的84.3%，提升至85.8%，並可用多層次刪減辨認搜尋演算法，將所需搜尋的音節狀態數與所需考慮的音節狀態轉移數，分別降至53.5%與25.3%。因此本論文所提出的方法，非常適合使用在中文語音辨認上。 In this dissertation, three recurrent neural network (RNN)-based speech recognition schemes are proposed. One is an RNN-based pre-classifier for improving the recognition speed of the HMM method. It first pre-classifies the input speech into three stable states, including it initial, final, and silence, and a transient state. It then set more restrict constraints in the recognition search for frames with these three stable states to prune some unlikely paths. Experimental results confirmed that it can be used in conjunction with the beam search algorithm. The computational cost of the beam search algorithm is further improved by dropping away additional 38.7% of searching states and by eliminating the likelihood calculations for additional 35.1% of Gaussian components with a paid of a degradation of 0.1% on the recognition rate. This confirms the efficiency of the proposed fast recognition method. Another is a modular RNN (MRNN)-based method for isolated Mandarin syllable recognition. It first employs the "divide-and-conquer" principle to divide the complicated task of recognizing 1280 syllables into five subtasks including three discrimination subtasks, respectively, for 100 initials, 39 finals, and 5 tones, and two broad-class classification subtasks, respectively, for three speech broad-classes of initial, final, and silence, and 9 initial sub-classes. It then uses five RNNs to attack each subtask separately. Outputs of these five RNNs are directly combined to form the discriminant functions for all 1280 syllables. The recognizer is further extended to include two MRNNs for both forward-time and backward-time. Experimental results in a multi-speaker syllable recognition task confirmed that it outperformed the MCE/GPD-trained HMM method on both the recognition complexity and accuracy. The base-syllable and syllable recognition rates of the MCE/GPD-trained HMM were further improved from 76.8% and 70.8% to 82.8% and 76.3% by the MRNN system. The other is an MRNN-based method for continuous Mandarin base-syllable recognition. It extends the previous MRNN method for isolated Mandain syllable recognition to additionally include a syllable boundary detection module and a multi-level pruning recognition search algorithm. Experimental results in a speaker-dependent speech recognition task showed that the proposed method also outperformed the MCE/GPD-trained HMM method. The base-syllable recognition rates of the ML-trained and MCE/GPD-trained HMM were further improved from 80.9% and 84.3% to 85.8% by the MRNN system. In addition, only 53.5% of the surviving base-syllable states and 25.3% of the surviving base-syllable transitions were needed to be considered in the multi-level pruning search with no cost for the degradation of recognition accuracy. From above discussions, we can therefore conclude that the RNN-based speech recognition approach is very promising for both isolated and continuous Mandarin speech.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT870435001 http://hdl.handle.net/11536/64458
Appears in Collections:	Thesis