標題: 中華現代人名與稱謂之結構分析
An Analysis on the Structure of Modern Chinese Name and Title
作者: 吳孟哲
Wu, Meng-Che
陳信宏
Chen, Sin-Horng
電信工程研究所
關鍵字: 命名實體;中華現代人名;稱謂;有限狀態機;Name Entity;Modern Chinese Name;Title;Finite-State Machine
公開日期: 2015
摘要: 本篇論文探討自然語言處理(Natural Language Processing, NLP)中命名實體 (Named Entity)所包含的人物名稱與稱謂名稱在中文現代語言的關係和種類,我們使用同義詞詞林與廣義知網中所收錄的稱謂種類並加入中央研究院詞性標記,透過語料庫Chinese Gigaword取出人名與稱謂的句型加以分析,再分別對人名與稱謂的種類進一步探討,文句間稱謂可能會與相鄰的詞彙合併使用並改變其語意,稱之為「複合式稱謂」,我們利用Mutual Information與T值分析複合式稱謂的結合程度並利用語意進一步將複合詞分類,同時,以社會評價的角度將稱謂區分為褒、貶評價稱謂並分析該稱謂的複合詞類別。參考我國內政部對全國總人口的人名統計,將人名分成「姓氏」與「名字」兩類分析,透過計算每個姓氏的人口數,進一步了解前一百大姓占我國總人口達96.56%,另外,我國在取用名字時會因為性別的不同使用不同的字詞,我們利用有限狀態機描述人名與稱謂的類別與句型,並分別建置醫學領域與教育領域的稱謂種類,再從測試語料庫中標記出所有相關領域的人名與稱謂句型,Recall Rate分別為90.6%與82.5%,結果顯示確實標記出該領域中大部分人名與稱謂的種類組合與句型。
The thesis discusses addressing terms and personal names are important elements in named entity is one of application from natural language processing (NLP). We collect addressing terms from CILIN and E-Hownet and discuss the types of personal names and addressing terms with the Linguistic Data Consortium (LDC)’s Chinese Gigaword by part-of-speech (POS) from Chinese Knowledge Information Processing Group cademia Sinica Institute of Information Science. In addressing terms, they may have some compound words beside them. Therefore, we use Mutual Information and T-score to get the compound type of addressing terms. In personal names, from the Department of Household Registration, M.O.I of R.O.C, we find personal names divide into first name and last name. And there are 96.56% of the populations from top 100 of last names. Moreover, in first name, people always use different words with their grender. After we know the types of personal names and addressing terms, we use the Finite-State Machine (FSM) to built addressing terms of medical and educational domain and as possible as to get all of the types of personal names and addressing terms in our test corpus. Finally, the recall rates are 90.6% and 82.5% and , indeed, the FSM get many types personal names and addressing terms from the corpus.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070160258
http://hdl.handle.net/11536/127267
顯示於類別:畢業論文