標題: 中文常用詞分群與詞組的關係
The Relation of Chinese Word Clustering and Word Chunking
作者: 楊婉君
YANG, WAN-CHUN
陳信宏
Sin-Horng Chen
電信工程研究所
關鍵字: 詞組;中文詞分群;Word Chunk;Word Clustering
公開日期: 2012
摘要: 詞組辨識在自然語言處理是一個關鍵的問題,不同組合的結構、詞性的變化、或是結構和字本身的歧義,都大大影響了詞組辨識的結果。好的詞組辨識可以幫助現今許多和自然語言處理相關的應用,例如:網路探勘、搜尋引擎、語音辨識、語音合成。但由於中文較其他語言複雜,文法架構複雜、缺乏大型標記過的語料庫、詞與詞之間沒有分界、詞彙量過大等問題,讓詞組辨識的研究變得更加困難。 本論文探討如何透過中文常用詞的分群來幫助構成詞組,首先對中文詞以統計分析前後詞相接的機率關係,透過對稱相對熵(Symmetric Kullback-Leibler divergence,簡稱KL),將依據機率統計選出的十八萬個常用詞加以分群。接者我們使用詞性(Part of Speech, POS)、分群文字、廣義知網語意單元等特徵參數,利用條件隨機場(conditional random field, CRF),由中央研究院標記的中文句結構樹資料庫(Sinica Treebank)建立一個基於語意與統計分布之中文基本詞組模型。我們也將中文常用詞的分群運用在Boundary Entropy來分析分界機率高的相鄰詞組合和根據M-value分析關聯性很高的詞群組組合,以進行詞組標記。
Word Chunking is an important task in NLP. The difference between the structure of word chunk, Part of Speech and the structure of word has effects on the recognition of word chunk. Good word chunker can be used in many applications of NLP, such as search engine, speech recognition and text to speech. Constrcuting a good Chinese word chunker is a challenging task because Chinese is more complicated than other languages in many aspects. For instances, Chinese grammar structure is very complicated, there is no boundary between words, and Chinese lexicon is large. This study focuses on Chinese word clustering and word chunking. We first use the TFIDF technique to create a lexicon with 180k words whose coverage is greater than 99%. We then find the relations of preceding-word and following-word to create feature vectors, and use symmetric Kullback-Leibler divergence (KL) to cluster words into classes. The results of word clustering are then used in two word chunking studies. One employs the condition random field (CRF) method to build a base-phrase chunk model from the Sinica Treebank Corpus. The features used include POS, information of word clusters, and sematic information in E-Hownet. In another study, we use the features of M-value and boundary entropy to find rules for chunking words and word-classes.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070060222
http://hdl.handle.net/11536/72352
Appears in Collections:Thesis