標題: 以鬆弛法做中文斷詞及其應用
CHINESE WORD IDENTIFICATIN BY THE RELAXATION TECHNIQUE AND ITS APPLICATION
作者: 范長康
FAN, CHANG-KANG
蔡文祥
CAI, WEN-XIANG
資訊科學與工程研究所
關鍵字: 中文斷詞;鬆弛法;語詞指派;音節對詞的指派;縮減碼對詞的指派;CHINESE-WORD-IDENTIFICATION;RELAXATION-TECHNIQUE
公開日期: 1990
摘要: A new approach to automatic word identification and three applications of the approach are proposed. Word identification is considered as a problem of finding consistent assignments of characters in a sentence to the composing words of the sentence, which is then solved by the use of the relaxation technique. The formation relationships among words in sentences are first analyzed and used to define the coefficients of compatibility among character-to-word assignments. The initial probability value for each assignment is estimated by the usage frequencies of the words. During the relaxation iterations, the probability value of each assignment is updated with the effects from its neighboring assignments according to the degree of compatibility between the assignment and its neighboring assignments. This will gradually eliminate those assignments which are not compatible with others and make compatible assignments more probinent. When certain termination conditions are satisfied, the iterations are stopped and the words selected by the remaining compatible assignments are taken as the results of the identification process. Based on the proposed relaxation-based word identification technique, three applications have been successfully developed, including sentence decomposition, phonetic input ambiguity removal,and key stroke number reduction. First, the relaxation based word identification etchnique can be used directly for sentence decomposition. The testing results indicate an identification rate over 95%. Next, the proposed relaxation based word identification technique is employed to resolve the ambiguity caused by homonyms generated in the phonetic Chinese input method. To accomodate the phenomenon of numerous homonyms, syllable-to-word assignments are defined and additional word formation relationships are included. In a modified version of the above-mentioned relaxation process, incompatible syllable-to-word assignments are gradually discarded and compatible ones left. When consistent assignments are obtained finally, the composing words of the sentence are identified and the corresponding characters for the syllables are also determined simultaneously. The experiments show a hit rate of about 96%. Finally, the proposed relaxation based word identification technique is used to solve the key stroke number reduction problem. The objective of key stroke number reduction is to allow entering only partial Chinese character input codes for Chinese input. This is helpful to those users who may occasionally forget one or two code elements of a character during Chinese input works. A two-stage process is proposed. The first stage is preprocessing aiming to finding all the homologues corresponding to the partial codes as well as all the words which can be formed by these homologues. The second stage is a modified version of the previously-described word identification process used to handle the numerous homologues and select the correct characters. A hit rate of about 90% is achieved for the case where at most two code elements of the input code of each Chinese character are allowed to be randomly omitted. Some conclusions and suggestions for future study are also included. 斷詞是中文資訊處理的一個重要步驟。由於中文書寫習慣各字間並無間隔符號,大多 數中文字均可做單字詞使用,在句子裡許多字與其前後的字各別均可相連成詞,以及 許多長詞裡包含數個短詞,這些中文的特性都會造成斷詞時的混淆。本論文提出一種 基於鬆弛原理的中文斷詞方法。將斷詞視做一種對句中各字做「字詞指派」的過程。 分析利用句中字詞間的組成關係做為指派方式的約束條件;並以機率式鬆弛循環建立 指派機率的修正模式。在執行鬆弛程序時,這些約束條件將剔除不相容的指派,而在 最後找出正確的斷詞結果。 本研究又以此為基礎發展出三項應用。第一項是直接將此方法用於把句子分解成各個 組成詞的片段,實驗結果得到95%的正確率。 此一程序經修改調整後應用到另外兩項中文處理的問題。其一與注音輸入時一音多字 有關;其二則與中文輸入時減省輸入碼數的問題有關。由於中文一音多字,故在注意 輸入時,使用者常須花費甚多精神在螢幕上的同音字集挑出所需要的字。本研究將此 問題轉換成「音節對詞的指派」,修改上述的斷詞程序可自動將輸入音節串轉換成對 應的中文字,實驗結果得到96%的正確率。另外由於多數中文碼編碼規則均甚繁雜, 一般使用者常苦於難以記全。本研究提出一種方法,允許使用者不必鍵入全碼,只須 鍵入縮減碼即可。方法是先找出所有對應於同一縮減碼的同碼字,以及它們所能組成 的詞,再運用「縮減碼對詞的指派」的觀念修改原來的鬆弛程序,來找出正確對應的 中文字。實驗結果在允許對每一字均可隨機減省二碼的情況下,得到90%的正確率。
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT792394049
http://hdl.handle.net/11536/55295
顯示於類別:畢業論文