以特徵參數抽取為基礎之類神經唇語辨識器

標題:	以特徵參數抽取為基礎之類神經唇語辨識器 Feature-Based Visual Speech Recognition Using Time-Delay Neural Network
作者:	梁欣蕙 Liang, Shin-Hwei 林進燈 Chin-Teng Lin 電控工程研究所
關鍵字:	Neural Network;TDNN;MDA;MFSA;Visual Speech Recognition
公開日期:	1996
摘要:	本論文針對唇語辨識系統提出一個自動偵測唇部特徵以及唇部動作辨認的方法。這個方法可以分成三個階段，分別是唇部區域的偵測和擷取、唇部特徵參數的抽取、以及類神經網路的學習。在唇部區域的偵測方面：第一個步驟是先找到人臉的位置。其中為了實際應用的考量我們不希望對使用者加上一些特定的限制。為此在這裡我們運用了赫夫轉換（Hough Transform）在背景複雜的情況下尋找可能的人臉位置﹔事實上，我們利用人臉對稱的特性將赫夫的搜尋降為三度空間的搜尋﹔接下來，本研究提出了一套唇部區域偵測的法則去驗證搜尋的區域，而且利用另外三個程序來對可能的唇部區域分別做正規化、重新調整、以及比對的工作。經過這些方法之後，只有一個唯一的唇部影像會由這些可能的唇部影像中被視為勝利者。在唇部特徵抽取的階段：我們提出一個程序方法來搜尋嘴角的位置以及一個精確的唇部特徵搜尋的法則來偵察四個重要的特徵點。這四個特徵點在我們的系統中扮演了很重要的角色。因為利用這四個點以及嘴角的位置可以計算出兩條拋物線，而這兩條拋物線可以用來建立精細的唇部模型，以及抽取十一個特徵參數來當作時間延遲網路的輸入向量。在最後一個階段：由於時間延遲網路具有對於訊號在時間軸上移動的容忍性，所以我們選取時間延遲網路當作我們的分類器。我們為了決定哪些唇部特徵在唇語辨識系統中是具有充足決定的影響而做了很多實驗。在離線的情況下，做單一語者的十個中文阿拉伯數字的辨識可以達到百分之九十的辨識率。事實上，我們也和另外兩個方法作比較﹔結果發現我們所提出來的方法可以用較少的記憶體空間和較少的訓練時間而達到更好的效果。最後，為了確保此方法的強韌性而推展整個系統到六個語者的辨識﹔由實驗結果再次證明了此方法的穩定性和實用性。 An automatic mouth feature detection and mouth motion recognition technique for visual speech recognition is proposed in this thesis. This technique consists of three stages : human mouth detection and extraction, mouth feature detection, and neural network learning. In the mouth detection stage, the first step is to find the locations of human faces without any constraints on the users for the consideration of practicability. Hough transform is used here for determining the candidate face locations under complex environments. We simplify it to a three-dimensional search and redefine the searching region using the symmetry property of human beings. Then, a Mouth Detection Algorithm (MDA) is proposed to verify the mouth location and the next three procedures are normalization, adjustment, and template matching for the candidate mouth images. After these processes only one mouth image is treated as the winner among the candidate mouth images. In the mouth feature detection stage, one procedure searches the mouth corners and a refined Mouth Feature Searching Algorithm (MFSA) is used to reconnoiter the four points on two lips. These four points play an important role in our system since two parabolas can be approximated using the mouth corners and these points. Finally, a precise mouth model is established after calculating two parabolas and selecting eleven features from the mouth model as the input patterns for the classifier. In the last stage, a TDNN is used as our classifier due to the tolerance of time shifting property. We have done many experiments to decide which kinds of features are crucial and sufficient enough in the lip- reading system. The off-line recognition rate can achieve 90% speaker dependently in our experiment. Two other methods are compared with our system and we find that our method can reach better performance than other two methods with the less memory space and training time. Finally, we generalize our system to a six speakers system to verify the robustness of our method. The experimental result shows the stability and practicability of the proposed approach.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT850327018 http://hdl.handle.net/11536/61672
Appears in Collections:	Thesis