標題: 使用密度、距離與相似度的爬山分群法
Peak-Climbing Clustering using Density, Distance and Similarity
作者: 高于涵
林志青
Gao, Yu-Han
Lin, Ja-Chen
資訊科學與工程研究所
關鍵字: 密度;距離與相似度;雜訊去除;高維度資料;降低維度;Density;Distance and Similarity;Noise Elimination;High-Dimensional Data;Dimensionality Reduction
公開日期: 2017
摘要: 隨著時代之進步,人們逐漸以機器代替人工進行許多不易實行的難題,例如:資料分析、影像處理、電腦圖學…等等。這些研究在電腦科學領域已經具有相當的影響力。資料分析可以視作是許多應用的基本與輔助之工具,其中,分群法為常使用的資料分析方法。分群法是指將一筆資料依據其屬性或特徵分成數個群,每一群裡面的資料點彼此愈相似愈好,群與群之間彼此愈不相似愈好。分群法有數種不同的類型,有一些分群法是基於距離的分群法,另一些是基於密度的分群法。K-means是基於距離的分群法,有效率,且易於實作,但是遇到圓圈型的資料型態,就無法得到理想的分群結果。DBSCAN是一種基於密度的分群法,改善了K-means的劣勢,資料型態不受拘束,但是不易實作,過於耗時。爬山分群法亦是基於密度的分群法,過程較DBSCAN簡單許多,所需的計算時間也相對較少,但是對於高維度資料卻無法如期降低時間成本。因此,本論文將爬山分群法做延伸改良,提出三種改善爬山分群法的方法。第一,使用子空間的概念,提出降低維度與維度切割再整合的方法,降低高維度資料的時間複雜度;第二,使用密度的概念,自動找出雜訊點;第三,分別結合了距離與相似度的方法進行微調,使得分群效果更加顯著。透過實驗,我們可以得知本論文提出的方法,可達到上述的目標。
Nowadays, people use computing machines to deal with many challenges such as data analysis, image processing, and computer graphics. These research has had obvious influence in many fields. Data analysis can be considered as a fundamental auxiliary tool in many applications. Clustering is a commonly used technique in data analysis. In clustering, according to the attributes or features, we divide the given data into many groups. The higher the similarity of data points of the same group is, and the lower the similarity between groups is, then the better the clustering result is. Clustering has several approaches. For instance, distance-based clustering, density-based clustering, etc. K-means is a distance-based clustering; and it is efficient and easy to implement. However, if the data distribution is of elongated shape, K-means does not give us ideal result. DBSCAN is a density-based clustering method, which improves the disadvantage of K-means. The data type is not restricted, but DBSCAN is time consuming and not easy to implement. Peak climbing clustering method is also density-based, but the process is much simpler than the DBSCAN, and the calculation time is relatively shorter. However, for high-dimensional data, the processing time increases sharply. Therefore, our thesis here modifies traditional peak climbing clustering method to get three improved versions. Firstly, the concept of subspace is utilized to reduce the time complexity caused by dimensionality. Secondly, the concept of density is utilized to find the noise automatically. Thirdly, distance and similarity are combined to do fine-tuning, and this improves the clustering result. Through experiments, we can see that the proposed techniques can achieve the above goals.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070356028
http://hdl.handle.net/11536/140513
顯示於類別:畢業論文