標題: | 基 於 動 態 調 整 權 重 之 co-cluster Co-cluster with dynamic weighting |
作者: | 張智愷 Chang, Chih-Kai 李嘉晃 Lee, Chia-Hoang 資訊科學與工程研究所 |
關鍵字: | 文件分群;文件分析;資料探勘;合作分群;Document Clustering;Text Analysis;Information Retrieval;Co-Clustering |
公開日期: | 2010 |
摘要: | 由於科技的進步,網路的發展,造成資訊量迅速攀升,然而這樣的進步卻相
對的造成使用者必須付出更多的時間去瀏覽所需的文件。有鑒於現今搜尋引擎的
廣泛使用,人們希望以更高的效率與效能取得資訊,其中分群的技術應用,扮演
著重要的角色。在搜尋的過程中,若能先將文件做好適當的分群,則可讓搜尋系
統提供更結構性的結果給使用者。如此一來,不僅可以減少搜尋文件的時間,更
可加快使用者找到自己想要的文件。
本研究利用Co-Clustering 的分群方法為基底並做更進一步的改良,針對分
群效能的改善以及feature 權重的增減加以討論,並且以Reuters、20newsgroup
及classic3 資料集做分析,萃取出核心關鍵字,並給予適當的權重,進而過濾一
些不必要的雜訊以及加強關鍵字的強度。利用座標的資訊,利用核心關鍵字在距
離群中心的距離為基礎做關鍵字之調整權重。接著,利用logistic function 的特性
對關鍵字之權重調整到介於0 與1 之間,再將關鍵字賦予調整後權重之後,再做
一次Co-Clustering,重複以上的動作達到收斂後,進而得到較高的分群結果。 This paper proposes a weighted co-clustering algorithm and applies it to document clustering problem. The weighted co-clustering is an extension of co-clustering, and it makes use of co-clustering properties to design a dynamic weighting algorithm for terms. Firstly, co-clustering presents both documents and words on the same coordinate system using spectral embedding technique. Secondly, co-clustering clusters documents and words simultaneously, so the documents that are within the same cluster should be clustered together with their corresponding words. Based on these two properties, the weighted co-clustering changes term weights iteratively. In addition, an outlier detection mechanism is proposed in this paper to eliminate outlier documents from clustering process. When the clustering process is completed, these outlier documents are assigned to appropriate clusters. We conduct experiments on three data sets and the experimental results show that the weighted co-clustering can effectively improve the performance. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079855595 http://hdl.handle.net/11536/48330 |
Appears in Collections: | Thesis |
Files in This Item:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.