標題: 混合型資料分群法之整合
Clustering Aggregation for Data Mixing Numerical and Categorical Types
作者: 林靚雯
林志青
Lin,Jing-Wen
Lin, Ja-Chen
資訊科學與工程研究所
關鍵字: 分群法;分群整合;文字型分群;數字型分群;混合型分群;clustering;Clustering Aggregation;clustering ensemble;Categorical data;Numerical data;link-based
公開日期: 2016
摘要: 當資料分散於不同系統中,或是不同的屬性值分別適用於不同的分群法時(例如資 料同時有文字屬性及數字屬性,則可分別使用適用於文字分群或數字分群的分群法)可 能會產生多個不同的分群結果。現在多數的分群整合方法對於數字型態的分群有著不 錯的效果,但是在文字或是混合型的資料型態上則較為侷限。本論文提出了一個可合 併混合型資料的演算法,且改善了傳統方法通常假設同一 base clustering 中的群為獨立 關係的盲點。本論文由 LCE[1] 方法做延伸。LCE 為一 link-based 方法,找出資料點與 各群的相似度後,以 Spectral Clustering 做分群。而在方法一中,我們透過資料點間接 得到群之間的相關性;在方法二中,再考慮了點與點之間的鄰居關係,並以兩點有共 同鄰居的比例作為兩點間的相似度;在方法三中,分別對不同的 base clustering 加上權 重,使與權重較低的 base clustering 相關的邊之影響力下降。
When the data are distributed over multiple systems; or when the data have different characteristics that should be treated using different clustering methods corresponding to the attributes of those characteristics (i.e., categorical and numeric attributes respectively corresponds to categorical and numeric clustering methods); then the clustering results may be multiple and different. Until now, most integrated methods perform well for numeric type clustering; however, they might have difficulty dealing with clustering results of categorical data or mixed data. Our thesis here presents some algorithms which can merge clustering results of mixed data. We improve the drawback of traditional methods because traditional methods usually supposes that the clusters in the same clustering result are independent. More precisely, our thesis extends and improves the LCE method which is a link-based method. Notably, after the similarity of data point within each cluster is found, the LCE method can cluster data by using Spectral Clustering. Our method can process numeric and categorical data simultaneously; and then combines both results together at the end. Our method can be divided into three versions. In Method 1, we get the relation between clusters by using the similarity between data points. In Method 2, we not only consider the neighborhood relationship between data points, but also define the similarity between two points by using the common neighbors of the two data points. In Method 3, we assign different weights to different base clustering methods; note that the clustering with smaller weight has less influence to the aggregation of clustering.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070356108
http://hdl.handle.net/11536/138797
顯示於類別:畢業論文