Full metadata record
DC FieldValueLanguage
dc.contributor.author冀泰石en_US
dc.contributor.authorCHI TAI-SHIHen_US
dc.date.accessioned2014-12-13T10:48:33Z-
dc.date.available2014-12-13T10:48:33Z-
dc.date.issued2009en_US
dc.identifier.govdocNSC98-2221-E009-092zh_TW
dc.identifier.urihttp://hdl.handle.net/11536/101368-
dc.identifier.urihttps://www.grb.gov.tw/search/planDetail?id=1898350&docId=314358en_US
dc.description.abstract自動語音辨識系統對於乾淨語音,經常能達到高辨識率,但是隨著干擾增加,效能卻會急劇的變差。另一方面,有著正常聽力的人,在吵雜的環境下,會比自動語音辨識系統有更穩健的表現。為什麼人們可以輕易做到呢?關鍵是聽覺場景分析。它是描述人類聽覺系統分析聲音瞬間特性和其特性隨時間連續變化的絕佳能力。這項能力可以讓我們將混合語音分出個別的聲音串流,如此一來我們就可在如雞尾酒宴會那般吵雜的環境下自在地和他人交談溝通。
在此次提出的研究中,我們將會模擬人類聽覺功能,發展出一套基於人類對聲音在頻域與時域上的感知特性的運算式聽覺場景分析演算法。我們將會運用一頻域與時域的聽覺分析模型去抽取語音中被人類用於聽覺場景分析時的重要線索。最被廣為認同為人類使用的線索是:音高(諧音性)、聲音起訖點、振幅調變和頻率調變。因此,我們目標的運算式聽覺場景分析演算法將會包含三個功能模組:(1)萃取四個感知線索、(2)同一時間和連續時間對語音做分群、(3)將分群後的頻譜圖轉回聲音。本運算式聽覺場景分析演算法的輸出是分離的聲音串流,我們將使用語音辨識模擬或語音品質量測來驗証本演算法之效能。
zh_TW
dc.description.abstractAlthough Automatic Speech Recognizers (ASR) usually achieve good recognition rates for clean speech samples, their performance drops dramatically with increasing interferences. On the other hand, people with normal hearing always perform more robust than ASRs in noisy environments. How do people do that easily? The key phrase is “Auditory Scene Analysis” (ASA), which refers to the human auditory system’s splendid ability to analyze instantaneous properties of sounds and their sequential variations over time, i.e., the spectral and temporal properties of sounds. This ability enables us to segregate sound mixtures into individual sound streams such that we can communicate freely with others in a cocktail party.
In this proposed research, we will develop a computational ASA (CASA) algorithm based on human perception on spectro-temporal properties of sounds to mimic human’s ability in hearing. We will utilize a spectro-temporal auditory analysis model to extract crucial cues of sounds, which are used by human in ASA process. Most common cues believed to be used by human are: pitch (harmonicity), onset/offset, amplitude modulation and frequency modulation. Therefore, our desired CASA algorithm will consist of three functional modules: (1) extraction of four perceptual cues; (2) simultaneous grouping and sequential grouping; (3) inversion of grouped spectro-temporal patterns back to acoustical waveforms. Outputs of the CASA algorithm are separated sound streams and will be used in a speech recognition simulation or a speech quality evaluation to validate our CASA algorithm.
en_US
dc.description.sponsorship行政院國家科學委員會zh_TW
dc.language.isozh_TWen_US
dc.subject運算式聽覺場景分析zh_TW
dc.subject聽覺分群zh_TW
dc.subject聽覺串流zh_TW
dc.subject雞尾酒宴會效應zh_TW
dc.subject聽覺合成器zh_TW
dc.subjectComputational Auditory Scene Analysisen_US
dc.subjectAuditory Groupingen_US
dc.subjectAuditory Streamingen_US
dc.subjectCocktail Party Effecten_US
dc.subjectAuditory Synthesizeren_US
dc.title運算式聽覺場景分析於語音分離之研究zh_TW
dc.titleComputational Auditory Scene Analysis for Speech Separationen_US
dc.typePlanen_US
dc.contributor.department國立交通大學電信工程學系(所)zh_TW
Appears in Collections:Research Plans