Title: 多層架構之中文具名實體辨識
A Multi-Layered Framework for Chinese Named Entity Recognition
Authors: 陳大任
李錫堅
資訊科學與工程研究所
Keywords: 具名實體;詞法分析;斷詞;未知詞;最大匹配;named entity;lexical analysis;word segmentation;unknown word;maximal matching
Issue Date: 2003
Abstract: 未知詞(out-of-vocabulary, OOV)的處理已成為高品質詞法分析(lexical analysis)不可或缺的關鍵之一。在所有未知詞中,具名實體(named entity, NE)不但是最多產的一種,也幾乎沒有產生規律可言,卻通常又是語句中最具意義的部分(人、事、時、地、物)。在這篇論文中,我們設計了一套對於中文具名實體的分類方法,並提出以「產生、濾除、回復」的多層架構來處理中文具名實體的辨識問題。本系統首先以一組統計模型及估算方法,盡量產生所有可能的候選具名實體,以取得高召回率(recall);接著將謬誤濾除當作模稜問題(ambiguity resolution)來處理,我們使用以最大匹配法(maximal-matching)為主的詞法分析器來解決模稜問題。最後,我們用文樣比對(pattern matching)來偵測前兩個階段所產成的異常錯誤,並加以回復。 我們的系統僅使用純粹字面資訊,並且在人名上取得96%的高召回率,在譯名、地名、組織名的召回率上,也分別取得令人滿意的88%、89%、與80%。整體來說本系統的準確度(precision)超過90%,排除率(excluding rate)超過99%;可以說我們僅使用相對較少的資訊,卻得到較好的成效。我們提出的架構仍保留許多模型設計上的彈性與改進空間。我們可以使用更精確的語言模型、更周詳的估算規則、加入更多的資訊與模型等;以在此架構下,達到最佳的效能。
The handling of out-of-vocabulary (OOV) words is one of the key points to high performance lexical analysis in natural language processing. Among all OOV words, named entities (NE) are the most productive ones and nearly no generation rules for them exist. Named entities generally constitute the most meaningful parts of sentences (persons, affairs, time, places, and objects). In this paper, we propose a classification of Chinese NEs and a multi-layered “generation, filtering, and recovery” framework to address the NER problem. In our system, a set of statistical models and heuristic rules are first used to generate all possible NE candidates to obtain a high recall rate. Then we treat the candidates filtering as an ambiguity resolution problem. To resolve the ambiguities, we adopt a maximal-matching-rule-driven lexical analyzer. Last, a rule-driven pattern matching method is applied to detect and recover abnormalities in the results of the previous two phases. Pure lexical information is exploited in our system. We get a high recall rate of 96% with personal names (PER), satisfiable recall rates of 88%, 89%, and 80% with transliteration names (TRA), location names (LOC), and organization names (ORG), respectively. The overall precision is over 90% and the excluding rate is over 99%. Our system exploits relatively simple information and obtains good performances. Our framework retains much flexibility for the refinement of the model design. There is still a lot of room for improvements. More precise language models could be adopted; more complete heuristic rules could be applied; and more knowledge and information could be added to achieve the ultimate performance under this framework.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009117517
http://hdl.handle.net/11536/49580
Appears in Collections:Thesis


Files in This Item:

  1. 751701.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.