標題: 中文文件自動分類
Automatic Classification of Chinese Documents
作者: 黃森原
Huang, Sen-Yang
林志青
Ja-Chen Lin
資訊科學與工程研究所
關鍵字: 文件分類;文件歸檔;詞彙;Document Classification;Document Filing;Keyword
公開日期: 1995
摘要: 隨著資訊的累積,越來越多的中文文件、報紙和雜誌必須以電腦來儲 存和分類。 因此,本論文提出了一個中文文件分類系統。首先,我們提 出了一個「中文文件自動化取詞」的演算法,此方法可以將中文文件中較 為重要的詞彙篩選出來,並且依照詞彙的位置給予不同的加權值。然後, 我們利用一些事先已知類別的文件,由機器去學習以建立系統的專有詞典 ,並且利用所有的詞彙在各個類別的重要性和離散度,來學習並建立起描 述詞彙和類別關係的加權矩陣。同時,我們也提出如何強化此一加權矩陣 的方法,以便讓文件分類結果的肯定度更高。最後,為了讓分類系統能擴 張以包含新的文件類別,我們利用動態分群和文件向量化的方法,來測試 目前系統是否可能含有新的類別,並且將可能群聚成新類的文件告知系統 管理者,以便利由系統管理者決定是否另設新的類別。實驗結果顯示所提 的技巧確實可以使用在報章雜誌等中文文件的分類上。 More and more Chinese documents, including newspapers and magazines, are stored and classified nowadays using computers. In this thesis, a Chinesedocument classification system is proposed. First, we introduce a keywordextraction algorithm for Chinese document. The algorithm is able to extractthe main keywords of the document and assign a so-called Local Weight to eachextracted keyword. Different weights are given to these keywords accordingto the positions of the keywords. Then we use some documents whose classesare already known as the training set so that the machine can learn and setup a special-purpose dictionary. The Energy and Entropy values of eachkeyword and each class are then employed to establish the weight matrix whichdescribes the relationship between keywords and classes. We also propose anenhanced method to tune up the weight matrix so that the certainty ofclassification can be higher. Finally, to extend the classification system sothat the new document classes can also be covered, we use dynamic clusteringto detect whether the current environment contains new classes, and report tothe system manager the documents which might form new classes. Experimentalresults showed that the proposed method can really work on the classificationof the Chinese documents such as newspapers and magazines.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT840394009
http://hdl.handle.net/11536/60450
顯示於類別:畢業論文