運算式聽覺場景分析於語音分離之研究

標題:	運算式聽覺場景分析於語音分離之研究 Computational Auditory Scene Analysis for Speech Separation
作者:	冀泰石 CHI TAI-SHIH 國立交通大學電信工程學系(所)
關鍵字:	運算式聽覺場景分析;聽覺分群;聽覺串流;雞尾酒宴會效應;聽覺合成器;Computational Auditory Scene Analysis;Auditory Grouping;Auditory Streaming;Cocktail Party Effect;Auditory Synthesizer
公開日期:	2009
摘要:	自動語音辨識系統對於乾淨語音，經常能達到高辨識率，但是隨著干擾增加，效能卻會急劇的變差。另一方面，有著正常聽力的人，在吵雜的環境下，會比自動語音辨識系統有更穩健的表現。為什麼人們可以輕易做到呢？關鍵是聽覺場景分析。它是描述人類聽覺系統分析聲音瞬間特性和其特性隨時間連續變化的絕佳能力。這項能力可以讓我們將混合語音分出個別的聲音串流，如此一來我們就可在如雞尾酒宴會那般吵雜的環境下自在地和他人交談溝通。在此次提出的研究中，我們將會模擬人類聽覺功能，發展出一套基於人類對聲音在頻域與時域上的感知特性的運算式聽覺場景分析演算法。我們將會運用一頻域與時域的聽覺分析模型去抽取語音中被人類用於聽覺場景分析時的重要線索。最被廣為認同為人類使用的線索是：音高(諧音性)、聲音起訖點、振幅調變和頻率調變。因此，我們目標的運算式聽覺場景分析演算法將會包含三個功能模組：(1)萃取四個感知線索、(2)同一時間和連續時間對語音做分群、(3)將分群後的頻譜圖轉回聲音。本運算式聽覺場景分析演算法的輸出是分離的聲音串流，我們將使用語音辨識模擬或語音品質量測來驗証本演算法之效能。 Although Automatic Speech Recognizers (ASR) usually achieve good recognition rates for clean speech samples, their performance drops dramatically with increasing interferences. On the other hand, people with normal hearing always perform more robust than ASRs in noisy environments. How do people do that easily? The key phrase is “Auditory Scene Analysis” (ASA), which refers to the human auditory system’s splendid ability to analyze instantaneous properties of sounds and their sequential variations over time, i.e., the spectral and temporal properties of sounds. This ability enables us to segregate sound mixtures into individual sound streams such that we can communicate freely with others in a cocktail party. In this proposed research, we will develop a computational ASA (CASA) algorithm based on human perception on spectro-temporal properties of sounds to mimic human’s ability in hearing. We will utilize a spectro-temporal auditory analysis model to extract crucial cues of sounds, which are used by human in ASA process. Most common cues believed to be used by human are: pitch (harmonicity), onset/offset, amplitude modulation and frequency modulation. Therefore, our desired CASA algorithm will consist of three functional modules: (1) extraction of four perceptual cues; (2) simultaneous grouping and sequential grouping; (3) inversion of grouped spectro-temporal patterns back to acoustical waveforms. Outputs of the CASA algorithm are separated sound streams and will be used in a speech recognition simulation or a speech quality evaluation to validate our CASA algorithm.
官方說明文件#:	NSC98-2221-E009-092
URI:	http://hdl.handle.net/11536/101368 https://www.grb.gov.tw/search/planDetail?id=1898350&docId=314358
顯示於類別：	研究計畫