一個找尋異常群集的快速分群演算法

標題:	一個找尋異常群集的快速分群演算法 A Fast Clustering Process for Outliers and Remainder Clusters
作者:	蘇志明 Chih-Ming Su 曾憲雄 Shian-Shyong Tseng 資訊科學與工程研究所
關鍵字:	資料探勘;分群法;異常節點;Data Mining;Clustering;Outliers
公開日期:	1998
摘要:	在許多的應用領域中，找尋或分辨出與一般群集差異相當大的異常節點，是一個非常重要且基礎的步驟。在傳統的資料分析或人工智慧領域中，往往需要將此異常節點加以排除，或給予較低的權重，以避免分析的結果產生極大的誤差。近幾年來，「資料探勘」技術日漸被重視，如何從大量資料中找尋引含且有用的資訊，這方面已有許多的文獻相繼被提出。然而以往「資料探勘」的方向著重於找尋常出現的關連集合，或交易記錄的趨勢。本篇論文中以反向的觀點，探討如何在大量資料中，找尋異於一般集合的異常資料。例如在電子郵件記錄檔中，找尋異於一般合理使用範圍的特殊紀錄，在網路管理的角度上，對此紀錄加以追蹤分析，將會是未來重要的參考資料。在本篇論文中，我們將提出一個兩階段的分群策略。在第一階段中，我們改良傳統k-means分群演算法，加入了一個「跳躍」的啟發策略，在遞迴的分群階段中，讓異常節點有更大的機率被視為獨立的群集。在第二階段中，利用「最小擴張樹」的概念，將第一階段的結果重新分群。最後我們藉由三類資料加以實驗，都得到非常良好的實驗結果。 Identifying outliers and remainder clusters which are used to designate few patterns that much different from other clusters is a fundamental step in many application domain. However, current outliers diagnostics are often inadequate when in a large amount of data. In this thesis, we propose a two-phase clustering algorithm for outliers. In Phase 1 we modified k-means algorithm by using the heuristic "if one new input pattern is far enough away from all clusters' centers, then assign it as a new cluster center". So that the number of clusters found in this phase is more than that originally set in k-means algorithm. And then we propose a clusters merging process in the second phase to merge the resulting clusters obtained in Phase 1 into the same number of clusters originally set by the user. The results of three experiments show that the outliers or remainder clusters can be easily identified by our method.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT870394013 http://hdl.handle.net/11536/64151
Appears in Collections:	Thesis