中文斷詞器之研究

標題:	中文斷詞器之研究 A Study of Chinese Parser
作者:	唐大任 Da-Ren Tang 王逸如 Yih-Ru Wang 電信工程研究所
關鍵字:	斷詞器;Parser
公開日期:	2001
摘要:	在本論文中，我們探討了斷詞器製作時的一些問題。首先利用斷詞規則與構詞規則配合詞庫，來幫助斷詞器斷詞，同時建立詞類雙連文模型，用以標記每個詞的詞類。在複合詞方面，由於定量複合詞與四字疊詞具有規律，因此我們利用構詞規則來結合，再使用斷詞規則挑選詞庫中的詞彙，或此複合詞。另外，若在輸出的詞串中有可結合的接頭/尾詞，我們則藉由規則將之與後/前面的詞彙結合成衍生詞。利用中研院提供的平衡語料庫，當作測試語料，幫助瞭解斷詞器性能。觀察斷詞結果，可發現我們結合出的長詞多比平衡語料庫還長，且我們認為結合的長詞是合理的，加上斷詞結果與平衡語料庫一致部分，斷詞器的正確率約達96%；其餘不正確處，則多是專有名詞與詞庫收錄未完備造成。至於詞類標記的正確性初步觀察還不錯，尚需適合的測試語料來更精確地測量。 In this thesis, the parser for Chinese was studied. A parser is used to identify the words and their associated part of speech (POS) in a Chinese sentence. In our parser, the word matching rules proposed by the Chinese knowledge Information Processing group (CKIP), Academia Sinica; and the word combination rules for compounds were used. First, in the word matching unit, the first word in word chunk with the maximal length and the most plausible will be selected. Then, the word combination rules－determinative-measure(DM) compound and reduplication rules can be used to group the words into compound . In the thesis, there were done before the word matching in order to solve some ambiguities in the word matching unit. A prefix/suffix word construction rules were also used for post-processing, which can further construct the words into a derive word. Finally, the POS bigram model was used to determine the POS of output words in parser. The Sinica Corpus published by CKIP was used to evaluate the performance of out system; and the average word length of our system was larger than that done by CKIP parser. The result of our parser was more suitable for a speech synthesis system.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT900435069 http://hdl.handle.net/11536/68947
Appears in Collections:	Thesis