標題: 基於Constrained-pLSA之半監督式判別分群
Semi-Supervised Discriminant Clustering via Constrained-pLSA
作者: 苟富昇
Gou, Fu-Sheng
李嘉晃
Lee, Chia-Hoang
多媒體工程研究所
關鍵字: 機器學習;半監督式學習;資料降維;machine learning;semi-supervised learning;dimensional reduction
公開日期: 2011
摘要: 由於科技的進步,網路的發展,造成網路上的文件量迅速攀升,如何讓使用者快速和正確的得到所需的資訊,成為一項重要的研究議題。在網路上可以輕易取得許多未標記資料;然而監督式學習方法,需要給足夠標記的資料訓練模型,資料標記往往需要浪費大量人力以及時間;而非監督式學習方法雖然不需要標記資訊,但是往往使用者在分群之前已經有些背景知識,理論上這些知識應該加入系統,讓系統可以將分群導向正確方向,所以本論文加入少許標記資料,利用這些已知的資訊,來達到更好的效果,同時不用介入過多的人力來幫助標記資料。
本論文提出了一個半監督式的學習法,同時兼具了降維與分群,本論文的方法透過Constrained-pLSA去取得每筆文件的群別機率歸屬值,再利用這個歸屬值去結合LDA﹙Linear Discriminant Analysis﹚,去尋找一個好的特徵空間,使其分群效果提升。本論文在實際的問題上,使用了CiteUlike、20Newsgroups及Reuters資料集做分析,使用本論文提出的方法,將高維度的資料集降到低維度,再來分群,最後實驗的結果顯示只需要少許的標記資料就可以讓本論文提出的方法有不錯的效果。
Document classification is of great practical importance today given the massive volume of online text available. Supervised learning is one of the popular techniques for tackling document classification problems. However, sufficient labeled data is necessary for supervised learning methods to train a classification model. Labeling must typically be done manually and it is a time-consuming process obviously. In general, unlabeled data may be relatively easy to collect. Although unsupervised learning methods don’t need any labeled data, users often have some background knowledge before clustering. Practically, background knowledge should be considered in the algorithms to improve clustering accuracy.
This paper proposes a semi-supervised learning algorithm, which considers dimension reduction and clustering simultaneously. This paper applies constrained-pLSA to obtain soft labels , and then combines soft labels with linear discriminant analysis to find a better feature space. We conduct experiments on CiteUlike, 20Newsgroups, Reuters and experimental results indicate that the proposed method can effectively improve clustering performance.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079957541
http://hdl.handle.net/11536/50607
Appears in Collections:Thesis