標題: 中文語言模型及語意概念自動標示
Language Model and Word Semantic Labeling in Chinese
作者: 蔡易儒
Cai, Yih-Ru
王逸如
Wang, Yih-Ru
電信工程研究所
關鍵字: 語言模型;K最近鄰居演算法;廣義知網;Language Model;KNN;E-HowNet
公開日期: 2015
摘要: 論文所探討之主題有二,一是改進語言模型,而改進語言模型包含斷詞器之改進、文字正規化、選詞等等改進,而其中為降低語言模型之複雜度,選詞我們希望可以選出常用詞彙做訓練,並且剔除部分詞頻高但分佈不均之詞彙,訓練出來之語言模型將其轉成加權有限狀態轉換機再用於音節串辨認,相較於傳統辨認系統,加權有限狀態轉換機有模型小、辨認時間短等等優點,最後我們以辨認率及複雜度之高低評測語言模型之優劣。 另外,我們亦希望可從文字語料中汲取更深層的詞彙意義,即藉由並且由《廣義知網》中固有詞彙的資訊做為樣本點並藉由文字語料訓練詞向量來賦予每一詞彙(樣本點)一詞向量,並透過最近鄰居演算法求詞與詞之間的餘弦相似度,再對於每一新詞彙自動標記其語意,除標記語意之外,我們亦對於《廣義知網》中詞彙之資訊、標記之正確率等等做更深入的探討。
In this thesis, we mainly reaserch on two topics, one is to improve language models, including improving the parser, text normalization, lexicons and so on. We choose common words for training language models in order to reduce complexity, and discard some frequent but uncommon words. We convert language models into weighted finite state transducer and apply it to syllable sequence recognition, comparing to conventional recognition systems, the weighted finite state transducer is relatively small and efficient. Finally, we measure the performance of the language model by recognition rate and complexity. In addition, we hope to extract more and deeper information of words from the text corpus, that is, we extract some word information from E-HowNet and training text corpus to assign each word (training example) a word vector, finding cosine similarity bwtween words and applying K-nearest neightbors algorithm to labeling each word one or more semantics. Besides, we discuss word information, the accuracy of word semantic labeling .etc further.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070260260
http://hdl.handle.net/11536/127287
顯示於類別:畢業論文