以序列標記方法解決古漢語斷句問題

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.author	黃瀚萱	en_US
dc.contributor.author	Hen-Hsen Huang	en_US
dc.contributor.author	孫春在	en_US
dc.contributor.author	Chuen-Tsai Sun	en_US
dc.date.accessioned	2014-12-12T01:19:15Z	-
dc.date.available	2014-12-12T01:19:15Z	-
dc.date.issued	2007	en_US
dc.identifier.uri	http://140.113.39.130/cdrfb3/record/nctu/#GT009555586	en_US
dc.identifier.uri	http://hdl.handle.net/11536/39538	-
dc.description.abstract	斷句是古漢語處理的特殊議題。在20世紀之前，中文的書寫系統，並沒有使用標點符號的習慣。在閱讀古籍的時候，讀者必須從文句中，辨別應該停頓或分隔的地方，而後才能理解文義。由於斷句並沒有明確的規則和方法，全憑讀者的語感和經驗來判斷，同一個句子，不同的讀者，往往會有不同的斷法，而不同的斷法，造成了不同的文義解讀。所以，在處理古籍的時候，斷句是重要而困難的第一步驟。過去沒有理想的自動化斷句方法，斷句的工作，多半交由文史專家，以人力來處理。雖然常見的經史典籍，目前已有斷句標點過的版本，但隨著歷史文獻不斷地發掘出土，仍然有無數的古代文獻，尚待斷句處理。在本研究中，我以hidden Markov models（HMMs）和conditional random fields（CRFs）等兩種序列標記模型，設計古漢文斷句系統，並在實驗中獲得不錯的斷句結果。同時，在實驗中也發現，只要training data的質量足夠，則具有跨文本、跨作者、跨體裁的適用性。例如，以《史記》作training data，對於其他上古漢語的文本，都有頗佳的斷句表現。本研究的成果，展現了自動化古漢語斷句的可行性，並得以實用在數位典藏、文字探勘、資訊擷取等工作上，輔助人力，更快速地處理大量歷史文獻。	zh_TW
dc.description.abstract	Sentence segmentation is a special issue in Classical Chinese language processing. To facilitate reading and processing of the raw Classical Chinese data, I proposed a statistical method to split unstructured Classical Chinese text into smaller pieces such as sentences and clauses. To build this segmenter, I transformed the sentence segmenting task to a character labeling task, and utilized two sequence labeling models, hidden Markov models (HMMs) and conditional random fields (CRFs), to perform the labeling work. My methods are evaluated on nine datasets from several eras (from the 5th century BCE to the 19th century). My CRF segmenter achieves an acceptable performance and can be applied on a variety of data from different eras.	en_US
dc.language.iso	zh_TW	en_US
dc.subject	古漢語斷句	zh_TW
dc.subject	自然語言處理	zh_TW
dc.subject	文本分割	zh_TW
dc.subject	序列標記	zh_TW
dc.subject	條件隨機域	zh_TW
dc.subject	Classical Chinese sentence division	en_US
dc.subject	natural language processing (NLP)	en_US
dc.subject	text segmentation	en_US
dc.subject	sequence labeling	en_US
dc.subject	conditional random fields (CRFs)	en_US
dc.title	以序列標記方法解決古漢語斷句問題	zh_TW
dc.title	Classical Chinese Sentence Division by Sequence Labeling Approaches	en_US
dc.type	Thesis	en_US
dc.contributor.department	資訊科學與工程研究所	zh_TW
顯示於類別：	Thesis

Files in This Item:

558601.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.