標題: 使用於複合詞組建構之中文詞分群
Chinese Word Clustering for compounding
作者: 鄭茂隆
王逸如
電信工程研究所
關鍵字: 中文詞 分群 復合詞;chinese word clustering compounding
公開日期: 2014
摘要: 本篇論文探討如何透過中文詞的分群來幫助複合詞構詞,首先使用前後詞相接的機率,定義對稱相對熵(Symmetric Kullback-Leibler divergence, KLD)為兩個詞間的距離,將含十二萬詞的詞典具有相同詞類的詞進行分群,再使用語意做進一步分群,接著將分群的結果運用在Forward and Backward Bigrams algorithm來找出關聯性很高的群組組合,進行複合詞構詞,經去除不合理的詞性組合以及使用統計法來偵測濾除pseudo compound word,獲得最後的構詞結果,最後使用中央研究院Treebank Corpus中具有two-word base phrases的句子做實驗來驗證此方法,實驗結果顯示此方法建構複合詞的F-measure達到0.58,較僅利用KLD做詞分群的結果做複合詞構詞的F-measure高出0.14。我們亦使用此複合詞構詞結果配合詞性、詞長、詞的第一個及最後一個字元等特徵參數,利用條件隨機場(conditional random field, CRF)以及向量支撐機(support vector machine, SVM)做基本片語偵測,由中央研究院標記的中文句結構樹資料庫(Sinica Treebank)做實驗,結果顯示使用SVM方法的F-measure為0.857,較使用CRF方法的F-measure高出0.011。
The thesis preposes to use word clustering to assist in Chinese word compounding. It first uses co-occurrence frequencies of the current word and two nearest neighboring words to define a symmetric Kullback-Leibler divergence (KLD) as the distance measure of two words. Using the KL distance, words of the same part-of-speech (POS) are divided into clusters. Further clustering is then performed based on semantic distance of words. The forward and backward bigrams algorithm is then applied to find highly correlated word-cluster pairs for compound-word construction. Word-cluster pairs of some unreasonable POS pairs are excluded. Besides, pseudo compound words are detected by using a stataistics-based method. The proposed method was evaluated using Sinica Treebank sentences comprising two-word basic phrases. Experimental results showed that an F-measure of 0.58 was achieved. The performance was better than the method using KLD-based clustering by 0.14. The word compounding results are then used together with POS, word length, and some word-level features to detect Chinese basic phrases. Two methods using conditional random field (CRF) and support vector machine (SVM) are studied. Experimental results on the Sinica Treebank Corpus showed that the F-measure of the SVM-based method is 0.857 which is 0.011 better than that of the CRF-based method
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070160254
http://hdl.handle.net/11536/75580
Appears in Collections:Thesis