基於條件隨機場之中文樹狀結構標記

標題:	基於條件隨機場之中文樹狀結構標記 A Conditional Random Field-based Chinese Tree Structure Labeling
作者:	康富傑 Kang, Fu-Jie 王逸如 Wang, Yih-Ru 電信工程研究所
關鍵字:	條件隨機場;中文句法結構樹;停頓標記;Conditional Random Field;Chinese syntax tree;Break prediction
公開日期:	2015
摘要:	中文句法樹狀結構剖析在中文的自然語言處理上是非常重要的工作，在中文裡，詞為有意義的最小語言單位，中文的句子是由多個詞所組成，對於詞與詞之間該如何連接、哪些詞又需要優先被連接即為結構剖析的工作。近年來的研究傾向於使用機器學習的方式來進行中文斷詞及剖析，傳統的中文結構樹對於樹狀結構的標示約能達到70%左右的F-measure，在本研究中的第一部分使用條件隨機場來進行中文句法結構的訓練及標記，所用的訓練語料為使用中研院詞庫小組剖析系統標記的剖析結果，由於剖析結果中並不是完全正確，將部分錯誤的剖析結果經過人工修改後，使用條件隨機場進行模型訓練及標記，對測試語料的結果評估可以達到80%以上的F-measure。由過去文獻中顯示，中文句法結構及中文語音韻律結構有一定程度的關係，本研究的第二部分根據停頓時長大小定義一停頓韻律樹，並使用與第一部分相同的機器學習方式來標記停頓韻律結構樹，標記結果顯示，對於較容易判別的長停頓B3、B4分別能達到57.80%及81.25%的正確率，而較難判定的短停頓B2-2則僅有35.54%。 In Chinese Natural language processing (NLP), syntax tree structure parsing is an important topic. The smallest meaningful unit is a word in Mandarin. Besides, a Chinese sentence is composed by many words. Thus, how to connect the words and which need to be connected at first is the role of parsing. Recent studies tend to use machine learning to parsing. The traditional Chinese parsing can almost achieved 70% F-measure. In our system, we train and label tree structure by Conditional random field (CRF). Training data use the parsing result by CKIP parser. We correct the parsing result which is not identical before model training. Using the CRF-based model to label testing data can achieve over 80% F-measure. In past work, Chinese Mandarin syntax is always in connection with Mandarin prosody. So, we defined a Prosodic Break tree by pause duration between words. Then label the break tree in the same method with syntax tree. We can achieve 57.80% and 81.25% correct rate to the long pause B3 and B4. And only 35.54% to the short pause B2-2.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT070260326 http://hdl.handle.net/11536/127286
顯示於類別：	畢業論文