基於Constrained-Nonnegative Matrix Factorization之半監督式分群法

標題:	基於Constrained-Nonnegative Matrix Factorization之半監督式分群法 Clustering with Labeled and Unlabeled Data Based on Constrained -Nonnegative Matrix Factorization
作者:	李炫勳 Li, Hsuan-Hsun 李嘉晃 Lee, Chia-Hoang 資訊科學與工程研究所
關鍵字:	機器學習;半監督式學習;machine learning;semi-supervised learning;non-negative matrix factorization
公開日期:	2011
摘要:	網路上的資料相當龐大而且雜亂，多屬於未標記的非結構化資料，使得分析這些資料的複雜度非常高，無法單純用人力來完成，因此必須透過機器來幫助資料分類或分群，重新組織這些資訊，成為有結構的知識；分群與分類方法可以分為兩類，分別為監督式學習與非監督式學習;監督式學習方法，需要給足夠標記的資料訓練分類模型，資料標記往往需要浪費大量人力以及時間；而非監督式學習方法雖然不需要標記資料，但是往往使用者在分群之前已經有些背景知識，理論上這些知識應該加入系統，讓系統可快速有效的分群。所以本論文加入少量標記資料，利用這已知的資訊，來達到更好的分群效果，同時可以減少人力且能來幫助資料分群。本論文提出Constrained-Nonnegative Matrix Factorization演算法，這是一種半監督式學習的演算法，透過少量標記資料做為限制條件，來提升整體分群效果。同時論文也設計一個Constrained-Fuzzy Cmeans演算法，只提供少量標記資訊，就能使效能明顯的提升。為了限制Constrained-Nonnegative Matrix Factorization在最佳的收斂範圍，論文運用Constrained-Fuzzy Cmeans來找到較佳的初始點，並透過標記資料設計限制條件，控制整體分群的分群效能，讓Constrained-Nonnegative Matrix Factorization有突出的效能表現。透過這樣的分群架構，實驗中我們比較其他半監督式方法，Constrained-Nonnegative Matrix Factorization確實展現了穩定且優越的效果。 Semi-supervised clustering methods ,which aim to cluster the data set under the guidance of some supervisory information, have become a topic of significant research. The supervisory information is usually used as the constraints to bias clustering toward a good region of search space. In this paper, we propose a semi-supervised algorithm, Constrained-Nonnegative Matrix Factorization, with a small amount of labeled data as constraints to cluster data. The proposed algorithm is a matrix factorization algorithm. Intuitively a good initial point can speed up clustering convergence and may lead to a better local optimized solution. As the result, we devise an algorithm called Constrained-Fuzzy Cmeans algorithm to obtain initial point. The evaluation function is a key element to evaluate the solution calculated by Constrained-Nonnegative Matrix Factorization, so we have some discussions about the evaluation of Constrained-Nonnegative Matrix Factorization. Finally we conduct experiments on several data sets including CiteUlike, Classic3, 20Newgroups and Reuters, and compare with other semi-supervised learning algorithms. The experimental result indicate that the method we proposed can effectively improve clustering performance.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT079955612 http://hdl.handle.net/11536/50520
Appears in Collections:	Thesis