基於印度餐廳過程的特征抽取整合分群法

標題:	基於印度餐廳過程的特征抽取整合分群法 Feature-Based Ensemble Clustering with Indian Buffet Process
作者:	魏新宇李嘉晃劉建良莊仁輝 WEI, XINYU Lee, Chia-Hoang Liu, Chien-Liang Chuang, Jen-Hui 資訊科學與工程研究所
關鍵字:	印度餐廳過程;整合分群;特征抽取;Indian buffet process;ensemble clustering;feature subsets selection
公開日期:	2016
摘要:	隨著科技的進步，資料數量也以指數級在增長，因而使得資料分群問題愈發重要。龐雜的資料使得我們不能再依賴人工去完成，這時就需要開發出各式新的算法，實現機器自動化分群，達到準確高效的目的。之前的研究結果顯示，使用整合分群的方式結合多個分群方法，并整合其結果，往往可提昇單一分群法之效能，同時可以讓分群結果更穩定。因此本文提出一種基於印度餐廳過程的特征抽取整合分群法（Feature based Ensemble Clustering with Indian Buffet Process）。該方法在分群時不需要知道資料應有的群數，它會在分群的過程中，通過對資料的學習，自行得出它認為最適合的群數。本論文使用品質以及差異性作為分群整合之依據，我們提出一個以印度餐廳過程（Indian Buffet Process, IBP）為基礎結合貪婪算法的特征抽取方法。另外，我們提出一個整合算法，通過整合所有分群結果得到最終結果，使分群效果得到了提升。此外，最後實驗結果顯示本論文提出的方法表現優於其他非監督式學習法。 As the development of technology, the amount of data grows exponentially. This makes data clustering more and more important, since clustering is an important technique in data exploration. Clustering is an unsupervised learning method, so improving performance and obtaining robust clustering results are challenging tasks in machine learning. Moreover, specifying the number of clusters in another problem for a certain class of clustering algorithms. Previous studies have shown that ensemble learning considers many clustering methods and aggregates their results, which can always yield a better and more robust result than a single one. This thesis proposes a feature-based ensemble clustering model based on the Indian Buffet Process(IBP). Additionally, the proposed model does not need to know the number of clusters in advance, and obtain the most suitable one for the data during the process of clustering. The proposed method uses quality and diversity as performance criteria to select feature subsets based on IBP and the proposed greedy algorithm. Each feature subset is considered as a view of the data and each subset results in ten clustering results. The final clustering result is the aggregation of these results by using the proposed aggregation algorithm. The experimental results indicate that the proposed model generally outperforms other unsupervised methods.
URI:	http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070356146 http://hdl.handle.net/11536/138957
Appears in Collections:	Thesis