使用於複合詞組建構之中文詞分群

Full metadata record

DC Field	Value	Language
dc.contributor.author	鄭茂隆	en_US
dc.contributor.author	王逸如	en_US
dc.date.accessioned	2014-12-12T02:43:36Z	-
dc.date.available	2014-12-12T02:43:36Z	-
dc.date.issued	2014	en_US
dc.identifier.uri	http://140.113.39.130/cdrfb3/record/nctu/#GT070160254	en_US
dc.identifier.uri	http://hdl.handle.net/11536/75580	-
dc.description.abstract	本篇論文探討如何透過中文詞的分群來幫助複合詞構詞，首先使用前後詞相接的機率，定義對稱相對熵(Symmetric Kullback-Leibler divergence, KLD)為兩個詞間的距離，將含十二萬詞的詞典具有相同詞類的詞進行分群，再使用語意做進一步分群，接著將分群的結果運用在Forward and Backward Bigrams algorithm來找出關聯性很高的群組組合，進行複合詞構詞，經去除不合理的詞性組合以及使用統計法來偵測濾除pseudo compound word，獲得最後的構詞結果，最後使用中央研究院Treebank Corpus中具有two-word base phrases的句子做實驗來驗證此方法，實驗結果顯示此方法建構複合詞的F-measure達到0.58，較僅利用KLD做詞分群的結果做複合詞構詞的F-measure高出0.14。我們亦使用此複合詞構詞結果配合詞性、詞長、詞的第一個及最後一個字元等特徵參數，利用條件隨機場(conditional random field, CRF)以及向量支撐機(support vector machine, SVM)做基本片語偵測，由中央研究院標記的中文句結構樹資料庫(Sinica Treebank)做實驗，結果顯示使用SVM方法的F-measure為0.857，較使用CRF方法的F-measure高出0.011。	zh_TW
dc.description.abstract	The thesis preposes to use word clustering to assist in Chinese word compounding. It first uses co-occurrence frequencies of the current word and two nearest neighboring words to define a symmetric Kullback-Leibler divergence (KLD) as the distance measure of two words. Using the KL distance, words of the same part-of-speech (POS) are divided into clusters. Further clustering is then performed based on semantic distance of words. The forward and backward bigrams algorithm is then applied to find highly correlated word-cluster pairs for compound-word construction. Word-cluster pairs of some unreasonable POS pairs are excluded. Besides, pseudo compound words are detected by using a stataistics-based method. The proposed method was evaluated using Sinica Treebank sentences comprising two-word basic phrases. Experimental results showed that an F-measure of 0.58 was achieved. The performance was better than the method using KLD-based clustering by 0.14. The word compounding results are then used together with POS, word length, and some word-level features to detect Chinese basic phrases. Two methods using conditional random field (CRF) and support vector machine (SVM) are studied. Experimental results on the Sinica Treebank Corpus showed that the F-measure of the SVM-based method is 0.857 which is 0.011 better than that of the CRF-based method	en_US
dc.language.iso	zh_TW	en_US
dc.subject	中文詞分群復合詞	zh_TW
dc.subject	chinese word clustering compounding	en_US
dc.title	使用於複合詞組建構之中文詞分群	zh_TW
dc.title	Chinese Word Clustering for compounding	en_US
dc.type	Thesis	en_US
dc.contributor.department	電信工程研究所	zh_TW
Appears in Collections:	Thesis