標題: 漢語問句偵測之量化研究
A Quantitative Study on Mandarin Question Detection
作者: 葉秉哲
Ping-Jer Yeh
Shyan-Ming Yuan
關鍵字: 漢語問句偵測;自然語言處理;統計推論;語言模型;Mandarin question detection;natural language processing (NLP);statistical inference;language models
公開日期: 2003
摘要: 「問句」是日常生活中最為人使用的語言行為之一,在電腦科學裡,舉凡人機對談、機器間對談、標點處理等次領域中,也都扮演著重要角色。少了「問句偵測與處理」此一環節,自然語言處理系統就不算完整。 由於語言本質的差異,再加上傳統上研究重心的不同,漢語的問句偵測要比英語更加困難。有鑑於此,本篇論文鎖定在這個相形之下較為基礎的議題上,並採取量化研究的角度。由於電子化語料資源的限制,本研究暫時只探討詞彙句法層次。 為了解決此一全新議題,本研究的策略是先追求召回率,再追求精確率。在召回率方面,我們先以數種統計推論及樣式比對技術進行單變數分析,成功發掘出較傳統語言學文獻所列更豐富、精確的詞彙特徵。接著我們以白箱式的雙變數分析排除部份誤判情況,以提升精確率。最後我們以數種黑箱式的語言模型技術進行複變數分析,成功分辨出更多情況。 在此研究中,我們達到不錯的召回率及精確率,並在漢語問句偵測議題上開拓一條新的量化研究途徑。
Question is one of the most fundamental and frequent speech acts in everyday life. It also plays an important role in sub-areas of computer science such as human-computer and computer-computer communication, and punctuation processing. An NLP application is not complete without proper detection and processing of question. Detection of Mandarin question is more difficult than that of English due to the nature of the language itself and the research focus in the Mandarin linguistics and NLP field. It is therefore the focus of this research to undertake a quantitative study on the more fundamental problem of detecting Mandarin question. Due to limited electronic resource, the study is confined to lexico-syntactic level. To tackle this new topic, our strategy is first trying to maximize recall and then to increase precision. To achieve higher recall, we first undertake univariate analysis on the datasets with a variety of statistical inference and pattern matching techniques. At this stage we successfully discover more comprehensive and precise features at word level than what linguistic literature has mentioned before. Next, to increase precision, we undertake white-box bivariate analysis to filter out some false positives from the previous stage. Finally we undertake black-box multivariate analysis by using several language modeling techniques. In this way we successfully discriminate more cases. We achieve good recall and precision in the preliminary study, and pioneer the quantitative study of Mandarin question.


