標題: 連續語音辨認的速度改進研究
Speed Improvement for Continuous Speech Recognition
作者: 馬偉雲
Wei-Yun Ma
劉啟民
Chi-Min Liu
資訊科學與工程研究所
關鍵字: 連續語音辨認;維特比;光束搜尋法;動態光束搜尋法;continuous speech recognition;viterbi;beam search;dynamic beam search
公開日期: 1998
摘要: 中文連續語音辨認技術要能夠實際應用到電腦輸入法,必須要有高辨識率以及快速的辨識時間才能達成。如何在維持高辨識率的情形之下,仍能大幅度的節省辨識時間即是本論文的研究目標。 在本論文中,使用中文的詞作為搜尋單位,在維特比搜尋法中,結合詞雙連語言模型做連續語音辨識,如此可同時整合聲學處理(Acoustic Processing) 與語言處理(Linguistic Processing)而得到整體最佳結果(Global Optimum) 。在Pentium- 450M Hz的測試環境下,使用20句連續中文語音作測試,字辨識率可達47.87%,平均一句話的辨識時間為13.4sec。這種作法雖然在辨識率上能夠得到很好的表現,但其搜尋空間十分龐大,以聲學處理來說,搜尋空間跟詞庫大小成正比。以語言處理來說,搜尋空間跟詞庫大小平方成正比,如此龐大的搜尋空間,將會嚴重影響辨識時間。因此本論文提出兩種方法來解決此一問題。第一種針對聲學處理設法縮小搜尋空間,改善傳統的光束搜尋法(Beam Search)固定光束寬的缺點,而提出一種能隨時間而動態調整光束寬的作法。字辨識率可達48.94%,辨識時間為9.93 sec。第二種針對語言處理設法縮小搜尋空間,在以詞為單位的辨識之前,先行用極快的方法,偵測哪些時間點是可能的詞和詞交接處。在這些時間點上才作語言處理的計算,來達到縮小搜尋空間的目的。字辨識率可達47.87%,平均辨識時間為8.54 sec。最後此兩種方法結合,可得到最佳的結果。字辨識率達48.94%,平均辨識時間為7.13 sec。
High recognition rate and quick response time are two fundamental requests in continuous speech recognition. In this thesis, we study the way to speedup the recognition time while retain the same recognition rate. In this thesis, we apply one-pass Viterbi algorithm to recognizing Mandarin sentences. We choose the word as the recognition unit and integrate word bigram into Viterbi algorithm. In the test environment of Pentium-450M Hz, our recognition rate is 47.87% and average recognition time is 13.4 sec for 20 sentences. Although the accuracy of this method is good, but the search space is very large. In acoustic processing, the search space is related to the vocabulary size. In linguistic processing, the search space is related to the square of vocabulary size. Such a large search space will increase recognition time seriously. Therefore we present two methods to solve this problem. The first method is the dynamic beam search which adjust beam width according to the current time to reduce search space in acoustic processing. In this method, the recognition rate is 48.94% and the average recognition time is 9.93 sec. The second method tries to reduce search space in linguistic processing. Before Viterbi search, we apply some fast algorithms to detect frames which could be the boundaries between words. Then, we apply the bigram model just in these frames. The recognition rate of this method is 47.87% and the average time is 8.54sec. Finally, we integrate these two methods and have the recognition rate, 48.94% and the average time, 7.13 sec.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT870392083
http://hdl.handle.net/11536/64108
Appears in Collections:Thesis