标题: 中文多义词标记及其在语言模型的应用
Chinese Multiple Word Sense Labeling and Its Application to Language Modeling
作者: 陈威远
王逸如
Chen, Wei-Yuan
电机工程学系
关键字: 语音辨认;语言模型;递回式类神经网路语言模型;词向量;一词多义向量;词义标记;speech recognition;language model;Recurrent Neural Network Language Model;word vector;multiple word sense vector;word sense label
公开日期: 2017
摘要: 本论文主要研究分为语言模型的改善和中文词向量的研究与应用。在语言模型的改善,我们使用加权有限状态转换机于语音辨识上,透过事先给定正确的音素序列取代声学模型,使得辨识结果完全由语言模型决定。我们藉由改善断词后处理和发音字典建立不同的语言模型使辨识率提升。
另外一个研究是有关中文词向量的研究与应用。我们研究一词多义对中文词向量的影响,使用非监督式的学习方法利用词向量标记一词多义,透过上下文环境和词性资讯进行词义标记来解决一词多义的问题,并将改善后的结果进行多种定性分析,最后将词义资讯加入于语言模型中,训练出一个具有词义资讯的语言模型。
This thesis can be divided into two parts, the improvement of language model and Chinese word embedding and its application. In the improvement of the language model, we use the weighted finite state transducer on speech recognition. We use the correct phoneme sequence to replace the acoustic model, which result the speech recognition only depend on language model. By improving the post-processing of word segmentation and pronunciation dictionary can enhance accuracy of speech recognition.
In Chinese word embedding, we study the polysemy effect on Chinese words vectors. To solve the problem of polysemy, we use unsupervised learning to label polysemy by multiple word sense vector which was learning from context and part-of-speech. We propose some qualitative analysis to measure the improvement. Finally, we construct a language model which contain the semantic information by word sense corpus which was labeled polysemy by multiple word sense vector.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070350741
http://hdl.handle.net/11536/140329
显示于类别:Thesis