標題: 漢語間統計式機器翻譯語料處理-用臺灣閩南語示範
Corpus Preprocessing for Statistical Machine Translation between the Chinese Languages - Using Taiwan Southern Min as Examples
作者: 薛丞宏
Sih, Sing-Hong
張智星
易志偉
Jang, Jyh-Shing
Yi, Chih-Wei
資訊科學與工程研究所
關鍵字: 臺灣閩南語;華語;翻譯;語料;斷詞;語言分類;Southern Min;Taiwanese;Mandarin;Translation;Corpus;Segmentation;Language Identification
公開日期: 2014
摘要: 臺灣是一个多元民族、多元語言的國家。 講母語、使用母語是上基本的權利, 毋過母語的電腦相關應用煞誠少, 需要加強自然語言處理的研究佮語料收集整理。 臺灣本土語言百百種, 本論文是針對閩南語, 研究伊翻譯語料的特性。 除了閩南語本身以外, 嘛希望研究結果對別的本土語言有幫助。 本論文提出一个自動整理漢語語料的方法, 予資訊無完整的語料庫補足資訊, 發揮上大的價值, BLEU分數對9.30搝到13.82。 另外閣用實驗證明平行語料數量無到十萬句的時, 加語料對翻譯的效果影響非常大, 原本64121句加到99147句了後, BLEU分數對13.82提昇到19.33。
Taiwan is a multi-culture and multi-language country. Speaking in mother tongues is a basic human right, but there are few computer applications for mother languages. The applications are supported by corpus and research of natural language processing. There are many local languages in Taiwan. This thesis focuses on Southern Min Taiwanese, is major local language in Taiwan. It contains research into corpus preprocessing to get good performance in statistical machine translation. We wish it can help the computational linguistic research of other local language of Taiwan. This thesis introduces a method to preprocess the corpus whose information is lacking. After refining, the BLEU score is raised from 9.30 to 13.82. Experiments in this thesis show that translation performance is sensitive to the amount of parallel corpus when the amount of parallel corpus sentences is less than 100,000. The BLEU score raises from 13.82 to 19.33 as the amount of sentences increased from 64121 to 99147.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070156016
http://hdl.handle.net/11536/125657
Appears in Collections:Thesis