Title: 中文文句自動斷詞標詞類之研究與應用
A Study on Automatic Segmentation and Tagging of Chinese Sentence
Authors: 蘇育新
Yuh-Shin Su
陳信宏
Sin-Horng Chen
電信工程研究所
Keywords: 斷詞; 詞類標示;Word Segmentation; POS Tagging
Issue Date: 1993
Abstract: 在本論文中,我們主要研究一套可對中文文句做自動斷詞標詞類的語言模
型,及基本的音轉字語言模型。我們首先以統計法及幾種類神經網路法訓
練出不同的語言模型參數,並設計自動標詞類系統,以對各模型參數進行
評估;之後我們選擇統計法及較好的類神經網路法的模型參數,並結合幾
種簡單的構詞法則,完成自動斷詞標詞類系統。此外,我們也以這些模型
參數設計了初步的音轉字語言模型。在我們的實驗中,訓練語料庫
有1930 3個詞,測試語料庫有4836個詞。在外部測試(Outside Test)方面
,以統計法所做的實驗可達97.1﹪的斷詞率及94.4﹪的詞類標示率,而在
類神經網路法方面,斷詞率為97.3﹪,詞類標示率則為94.2﹪。另外,音
轉字的正確率以統計法可達91.0﹪,而類神經網路法則為90.9﹪。
Two approaches of automatic segmentation and tagging for
Chinese sentences are studied in this thesis. One is a
statistical approach which uses an explicit bigram language
model and the other is a neural net approach which uses MLP to
predict POS's of words. Performance of these two methods was
examined by simulations using a database with 19303 training
words and 4836 testing words. Segmentation rates and tagging
rates of 97.1% and 94.4% for the statistical method and of
97.3% and 94.2% for the neural net method were achieved.
Extension of these two methods to the application of phoneme-to-
text conversion is also studied using the same database.
Character accuracy rates of 91.0% and 90.9% were respectively
obtained by these two methods.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT820436031
http://hdl.handle.net/11536/58160
Appears in Collections:Thesis