標題: | 基於模擬聲門來源波型之語者辨識系統與確認技術 A Text-independent Speaker Recognition System based on the Modeling of the Glottal Flow Derivation Waveform |
作者: | 游家昇 林進燈 電控工程研究所 |
關鍵字: | 聲門來源;語者辨識;口腔模型;glottal flow;speaker recognition;speaker verification;vocal tract |
公開日期: | 2005 |
摘要: | 本論文提出一個詞語不相關且能自動化計算及模擬聲門來源波型並能將其模型參數傳遞至語者辨識與確認的系統。由於語音訊號的產生是由聲門來源波型與人體口腔交互作用產生,而我們假設聲門來源波型包含大部分語者生物特徵,進而以本論文中的實驗加以驗證。聲門來源波型的取得是利用大量X光為基礎而以數位訊號方式模擬人體口腔模型而將所需要的反函數求出再將之與原語音的頻譜圖做相乘而得之。而所得的模型參數被用於具有26維度其中包含12維的梅爾倒頻譜參數、8維的delta cepstral參數、4維的delta-delta-cepstral參數、1維的delta-energy參數和1維的 delta-delta-energy參數置入高斯混合模型辨識器 (Gaussian Mixture Model,GMM)。此辨識器使用傳統的高斯混合模型與最大相似度法則 (Maximization Likelihood,ML) 去計算背景模型與假設模型間高斯混合模型分數的差異量。如前述本論文的目的在於驗證聲門來源波型是否包含語者生物特徵而非針對其辨識率做最佳化。本論文利用TIMIT的龐大資料庫,在不分男性女性的情況下辨識率約在60%左右,且利用相同語者資料的情況能比傳統以MFCC及ML做為最佳化GMM的架構相比,本論文所提出的新架構有較佳的辨識結果亦驗證說明聲門來源波型部分的確能夠傳送語者的生物特徵。 A text-independent and automatic technique for estimating and modeling the glottal flow derivative source waveform from speech signals and applying the model parameters to speaker recognition system, is presented. Because a speech signal is produced by the interactions between the glottal flow derivative and human vocal tract, we assume that the speaker identity information is included in the glottal flow derivative waveform, in this thesis we setup some experiments to verify the assumption. The glottal flow derivative is estimated by using an inverse filtering technique which obtained from the vocal tract system which is established by large database of x-ray pictures and simulated by digital signal processing multiplies the frequency domain value of the original speech signals. And the model parameters are used in a ML-based Gaussian Mixture Model (GMM) classifier with 26 dimensions features including 12 order Mel-Frequency Cepstral Coefficient、8 order delta-cepstral、4 order delta-delta-cepstral、1 order delta-energy and 1 order delta-delta-energy parameters. The classifier uses the traditional ML-based GMM and Expectation Maximization (EM) algorithm to calculate the differences between the scores of the background model and the hypothesized model. For a large TIMIT database set, the average correct rate over male and female in our experiments is about 60%. And under the same criterions, the recognition rate of our proposed structure is better than the ML-based GMM model with MFCC features. This corresponds to our assumption that the glottal flow derivative waveform indeed can convey the speaker identity information. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT009312547 http://hdl.handle.net/11536/78229 |
顯示於類別: | 畢業論文 |