基於廣義知網之半監督式中文詞義消歧

標題:	基於廣義知網之半監督式中文詞義消歧 Semi-supervised Chinese word sense disambiguation based on E-HowNet
作者:	林晉逵陳信宏 Lin, Chin-Kui Chen, Sin-Horng 電機工程學系
關鍵字:	詞義消歧;詞向量;詞義向量;詞義標記;一詞多義;多義詞;半監督式;word sense disambiguation;word vector;sense vector;semi-supervised
公開日期:	2017
摘要:	語言的使用不斷在變化，衍生出一詞在不同的上下文中具有不同詞義的狀況，而一詞多義在自然語言處理應用上有很大的影響，以詞向量為例，在詞向量的建立中，一個詞只會建立一個對應的詞向量，如不進行詞義消歧的動作，詞向量的建立效果有限。在缺乏中文詞義語料可供使用的情況下，本論文提出以廣義知網(E-HowNet)為基礎之半監督式詞義消歧法，在無詞義標記語料使用下，經由廣義知網內詞義相關資訊，定義多義詞之詞義數量及其定義，並針對目標多義詞各詞義搜尋定義匹配之近義詞，利用對應之近義詞於目標多義詞上下文環境(context window)中出現機率判斷詞義，以近義詞為依據將語料庫做詞義標記。擁有詞義標記語料後，將帶詞義標記之語料經Word2vec訓練得詞義向量，後以詞義向量計算目標多義詞各詞義於上下文環境中出現機率，對語料庫進行重新標記。重新標記後以更好的標記結果訓練得更準確的詞義向量，而準確的詞義向量可達成更精準的標記，由此過程反覆標記、訓練，以獲得更準確的標記結果及更好的詞義向量。 This thesis proposed a semi-supervised Chinese word sense disambiguation based on E-HowNet, training word vector into sense vector. Because the lack of Chinese word sense-tagged corpus, we labeled corpus with sense by synonyms we sorted from E-HowNet. We used context words of the target word as information to predict which synonyms is most likely to be here. And label the target word with the corresponding sense of the synonyms. After initial labeling, we trained sense vector via our sense-tagged corpus by word2vec. And we relabel the corpus by the sense vector. We get better result from relabeling, and we retrain the relabeled corpus to get better sense vector. Iterated by relabeling and retraining, we construct a Chinese word sense-tagged corpus and sense vectors.
URI:	http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070450744 http://hdl.handle.net/11536/142290
Appears in Collections:	Thesis