標題: | 基於Constrained-Nonnegative Matrix Factorization之半監督式分群法 Clustering with Labeled and Unlabeled Data Based on Constrained -Nonnegative Matrix Factorization |
作者: | 李炫勳 Li, Hsuan-Hsun 李嘉晃 Lee, Chia-Hoang 資訊科學與工程研究所 |
關鍵字: | 機器學習;半監督式學習;machine learning;semi-supervised learning;non-negative matrix factorization |
公開日期: | 2011 |
摘要: | 網路上的資料相當龐大而且雜亂,多屬於未標記的非結構化資料,使得分析這些資料的複雜度非常高,無法單純用人力來完成,因此必須透過機器來幫助資料分類或分群,重新組織這些資訊,成為有結構的知識;分群與分類方法可以分為兩類,分別為監督式學習與非監督式學習;監督式學習方法,需要給足夠標記的資料訓練分類模型,資料標記往往需要浪費大量人力以及時間;而非監督式學習方法雖然不需要標記資料,但是往往使用者在分群之前已經有些背景知識,理論上這些知識應該加入系統,讓系統可快速有效的分群。所以本論文加入少量標記資料,利用這已知的資訊,來達到更好的分群效果,同時可以減少人力且能來幫助資料分群。
本論文提出Constrained-Nonnegative Matrix Factorization演算法,這是一種半監督式學習的演算法,透過少量標記資料做為限制條件,來提升整體分群效果。同時論文也設計一個Constrained-Fuzzy Cmeans演算法,只提供少量標記資訊,就能使效能明顯的提升。為了限制Constrained-Nonnegative Matrix Factorization在最佳的收斂範圍,論文運用Constrained-Fuzzy Cmeans來找到較佳的初始點,並透過標記資料設計限制條件,控制整體分群的分群效能,讓Constrained-Nonnegative Matrix Factorization有突出的效能表現。透過這樣的分群架構,實驗中我們比較其他半監督式方法,Constrained-Nonnegative Matrix Factorization確實展現了穩定且優越的效果。 Semi-supervised clustering methods ,which aim to cluster the data set under the guidance of some supervisory information, have become a topic of significant research. The supervisory information is usually used as the constraints to bias clustering toward a good region of search space. In this paper, we propose a semi-supervised algorithm, Constrained-Nonnegative Matrix Factorization, with a small amount of labeled data as constraints to cluster data. The proposed algorithm is a matrix factorization algorithm. Intuitively a good initial point can speed up clustering convergence and may lead to a better local optimized solution. As the result, we devise an algorithm called Constrained-Fuzzy Cmeans algorithm to obtain initial point. The evaluation function is a key element to evaluate the solution calculated by Constrained-Nonnegative Matrix Factorization, so we have some discussions about the evaluation of Constrained-Nonnegative Matrix Factorization. Finally we conduct experiments on several data sets including CiteUlike, Classic3, 20Newgroups and Reuters, and compare with other semi-supervised learning algorithms. The experimental result indicate that the method we proposed can effectively improve clustering performance. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079955612 http://hdl.handle.net/11536/50520 |
Appears in Collections: | Thesis |