完整後設資料紀錄
DC 欄位語言
dc.contributor.author鄭茂隆en_US
dc.contributor.author王逸如en_US
dc.date.accessioned2014-12-12T02:43:36Z-
dc.date.available2014-12-12T02:43:36Z-
dc.date.issued2014en_US
dc.identifier.urihttp://140.113.39.130/cdrfb3/record/nctu/#GT070160254en_US
dc.identifier.urihttp://hdl.handle.net/11536/75580-
dc.description.abstract本篇論文探討如何透過中文詞的分群來幫助複合詞構詞,首先使用前後詞相接的機率,定義對稱相對熵(Symmetric Kullback-Leibler divergence, KLD)為兩個詞間的距離,將含十二萬詞的詞典具有相同詞類的詞進行分群,再使用語意做進一步分群,接著將分群的結果運用在Forward and Backward Bigrams algorithm來找出關聯性很高的群組組合,進行複合詞構詞,經去除不合理的詞性組合以及使用統計法來偵測濾除pseudo compound word,獲得最後的構詞結果,最後使用中央研究院Treebank Corpus中具有two-word base phrases的句子做實驗來驗證此方法,實驗結果顯示此方法建構複合詞的F-measure達到0.58,較僅利用KLD做詞分群的結果做複合詞構詞的F-measure高出0.14。我們亦使用此複合詞構詞結果配合詞性、詞長、詞的第一個及最後一個字元等特徵參數,利用條件隨機場(conditional random field, CRF)以及向量支撐機(support vector machine, SVM)做基本片語偵測,由中央研究院標記的中文句結構樹資料庫(Sinica Treebank)做實驗,結果顯示使用SVM方法的F-measure為0.857,較使用CRF方法的F-measure高出0.011。zh_TW
dc.description.abstractThe thesis preposes to use word clustering to assist in Chinese word compounding. It first uses co-occurrence frequencies of the current word and two nearest neighboring words to define a symmetric Kullback-Leibler divergence (KLD) as the distance measure of two words. Using the KL distance, words of the same part-of-speech (POS) are divided into clusters. Further clustering is then performed based on semantic distance of words. The forward and backward bigrams algorithm is then applied to find highly correlated word-cluster pairs for compound-word construction. Word-cluster pairs of some unreasonable POS pairs are excluded. Besides, pseudo compound words are detected by using a stataistics-based method. The proposed method was evaluated using Sinica Treebank sentences comprising two-word basic phrases. Experimental results showed that an F-measure of 0.58 was achieved. The performance was better than the method using KLD-based clustering by 0.14. The word compounding results are then used together with POS, word length, and some word-level features to detect Chinese basic phrases. Two methods using conditional random field (CRF) and support vector machine (SVM) are studied. Experimental results on the Sinica Treebank Corpus showed that the F-measure of the SVM-based method is 0.857 which is 0.011 better than that of the CRF-based methoden_US
dc.language.isozh_TWen_US
dc.subject中文詞 分群 復合詞zh_TW
dc.subjectchinese word clustering compoundingen_US
dc.title使用於複合詞組建構之中文詞分群zh_TW
dc.titleChinese Word Clustering for compoundingen_US
dc.typeThesisen_US
dc.contributor.department電信工程研究所zh_TW
顯示於類別:畢業論文