運用潛在語意索引的自動化文件分類

標題:	運用潛在語意索引的自動化文件分類 Automatic Classification of Text Documents by Using Latent Semantic Indexing
作者:	汪若文 Juo-Wen Wang 劉敦仁 Duen-Ren Liu 管理學院資訊管理學程
關鍵字:	自動分類;資訊擷取;潛在語意索引;向量空間法;automatic classification;information retrieval;latent semantic indexing;vector space model
公開日期:	2003
摘要:	在資訊擷取的領域中，資訊的搜尋與瀏覽是兩項非常重要的課題。雖然資訊的搜尋提供使用者快速找到所需資料的方法，但過渡依賴文字比對的檢索方式，無法有效處理同義字與一字多義等問題，加上使用者有時不見得能下達良好的搜尋條件，可能導致使用者無法找到真正所需的資料。因此要提供良好的資訊服務，除了提供資訊的搜尋外，透過良好的分類機制，提供資訊瀏覽的服務，是相當重要而具互補效果的功能。要提供相關的文件瀏覽服務，良好的文件分類是非常重要且基本的工作。文件的分類可分為兩個步驟：首先將文件以適當的適當的數學形式加以表述，其次是利用適當的分類演算法對文件進行自動分類。文件的分類是一種概念化的工作。傳統以向量空間法對文件進行表述，難以擺脫對於文字的直接依賴。潛在語意索引 (latent semantic indexing) 的目的在於發掘潛藏文件中的語意概念，而語意概念正好是文件分類的關鍵所在，因此將此技術應用於文件的分類，應有不錯的成效。本研究嘗試使用潛在語意索引技術進行文件的表述，配合中心向量法與 k-NN 兩種分類演算法進行自動化文件分類，探討其可行性與效果。另外並以向量空間法配合上述兩種分類演算法作為對照，比較兩者的分類效果。本研究探討的是單一分類的問題。研究結果顯示，利用潛在語意索引技術進行文件的表述，配合適當的分類演算法，可以得到穩定的分類結果，因此將潛在語意索引運用於自動化文件分類是可行的。但在本研究中，無論是搭配中心向量法或 k-NN 法，運用潛在語意索引的分類正確率都不及向量空間法。至於潛在語意索引技術是否較適合運用於多分類的問題，或是潛在語意索引技術與其他分類演算法搭配可得較佳分類結果，則有待進一步的研究探討。 Search and browse are both important tasks in information retrieval. Search provides a way to find information rapidly, but relying on words makes it hard to deal with the problems of synonym and polysemy. Besides, users sometimes cannot provide suitable query and cannot find the information they really need. To provide good information services, the service of browse through good classification mechanism as well as information search are very important. There are two steps in classifying documents. The first is to present documents in suitable mathematical forms. The second is to classify documents automatically by using suitable classification algorithms. Classification is a task of conceptualization. Presenting documents in conventional vector space model cannot avoid relying on words explicitly. Latent semantic indexing (LSI) is developed to find the semantic concept of document, which may be suitable for the classification of documents. This thesis is intended to study the feasibility and effect of the classification of text documents by using LSI as the presentation of documents, and using both centroid vector and k-NN as the classification algorithms. The results are compared to those of the vector space model. This study deals with the problem of one-category classification. The results show that automatic classification of text documents by using LSI along with suitable classification algorithms is feasible. But the accuracy of classification by using LSI is not as good as by using vector space model. The effect of applying LSI on multi-category classification and the effect of combining LSI with other classification algorithms need further studies.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT008864515 http://hdl.handle.net/11536/75112
Appears in Collections:	Thesis

Files in This Item:

451501.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.