Title: 中文多義詞標記及其在語言模型的應用
Chinese Multiple Word Sense Labeling and Its Application to Language Modeling
Authors: 陳威遠
王逸如
Chen, Wei-Yuan
電機工程學系
Keywords: 語音辨認;語言模型;遞迴式類神經網路語言模型;詞向量;一詞多義向量;詞義標記;speech recognition;language model;Recurrent Neural Network Language Model;word vector;multiple word sense vector;word sense label
Issue Date: 2017
Abstract: 本論文主要研究分為語言模型的改善和中文詞向量的研究與應用。在語言模型的改善,我們使用加權有限狀態轉換機於語音辨識上,透過事先給定正確的音素序列取代聲學模型,使得辨識結果完全由語言模型決定。我們藉由改善斷詞後處理和發音字典建立不同的語言模型使辨識率提升。
另外一個研究是有關中文詞向量的研究與應用。我們研究一詞多義對中文詞向量的影響,使用非監督式的學習方法利用詞向量標記一詞多義,透過上下文環境和詞性資訊進行詞義標記來解決一詞多義的問題,並將改善後的結果進行多種定性分析,最後將詞義資訊加入於語言模型中,訓練出一個具有詞義資訊的語言模型。
This thesis can be divided into two parts, the improvement of language model and Chinese word embedding and its application. In the improvement of the language model, we use the weighted finite state transducer on speech recognition. We use the correct phoneme sequence to replace the acoustic model, which result the speech recognition only depend on language model. By improving the post-processing of word segmentation and pronunciation dictionary can enhance accuracy of speech recognition.
In Chinese word embedding, we study the polysemy effect on Chinese words vectors. To solve the problem of polysemy, we use unsupervised learning to label polysemy by multiple word sense vector which was learning from context and part-of-speech. We propose some qualitative analysis to measure the improvement. Finally, we construct a language model which contain the semantic information by word sense corpus which was labeled polysemy by multiple word sense vector.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070350741
http://hdl.handle.net/11536/140329
Appears in Collections:Thesis