Title: 中文文件自動分類
Automatic Classification of Chinese Documents
Authors: 黃森原
Huang, Sen-Yang
林志青
Ja-Chen Lin
資訊科學與工程研究所
Keywords: 文件分類;文件歸檔;詞彙;Document Classification;Document Filing;Keyword
Issue Date: 1995
Abstract: 隨著資訊的累積,越來越多的中文文件、報紙和雜誌必須以電腦來儲
存和分類。 因此,本論文提出了一個中文文件分類系統。首先,我們提
出了一個「中文文件自動化取詞」的演算法,此方法可以將中文文件中較
為重要的詞彙篩選出來,並且依照詞彙的位置給予不同的加權值。然後,
我們利用一些事先已知類別的文件,由機器去學習以建立系統的專有詞典
,並且利用所有的詞彙在各個類別的重要性和離散度,來學習並建立起描
述詞彙和類別關係的加權矩陣。同時,我們也提出如何強化此一加權矩陣
的方法,以便讓文件分類結果的肯定度更高。最後,為了讓分類系統能擴
張以包含新的文件類別,我們利用動態分群和文件向量化的方法,來測試
目前系統是否可能含有新的類別,並且將可能群聚成新類的文件告知系統
管理者,以便利由系統管理者決定是否另設新的類別。實驗結果顯示所提
的技巧確實可以使用在報章雜誌等中文文件的分類上。
More and more Chinese documents, including newspapers and
magazines, are stored and classified nowadays using computers.
In this thesis, a Chinesedocument classification system is
proposed. First, we introduce a keywordextraction algorithm
for Chinese document. The algorithm is able to extractthe main
keywords of the document and assign a so-called Local Weight to
eachextracted keyword. Different weights are given to these
keywords accordingto the positions of the keywords. Then we
use some documents whose classesare already known as the
training set so that the machine can learn and setup a
special-purpose dictionary. The Energy and Entropy values of
eachkeyword and each class are then employed to establish the
weight matrix whichdescribes the relationship between keywords
and classes. We also propose anenhanced method to tune up
the weight matrix so that the certainty ofclassification can
be higher. Finally, to extend the classification system sothat
the new document classes can also be covered, we use dynamic
clusteringto detect whether the current environment contains new
classes, and report tothe system manager the documents which
might form new classes. Experimentalresults showed that the
proposed method can really work on the classificationof the
Chinese documents such as newspapers and magazines.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT840394009
http://hdl.handle.net/11536/60450
Appears in Collections:Thesis