标题: 基于Constrained-pLSA之半监督式判别分群
Semi-Supervised Discriminant Clustering via Constrained-pLSA
作者: 苟富升
Gou, Fu-Sheng
李嘉晃
Lee, Chia-Hoang
多媒体工程研究所
关键字: 机器学习;半监督式学习;资料降维;machine learning;semi-supervised learning;dimensional reduction
公开日期: 2011
摘要: 由于科技的进步,网路的发展,造成网路上的文件量迅速攀升,如何让使用者快速和正确的得到所需的资讯,成为一项重要的研究议题。在网路上可以轻易取得许多未标记资料;然而监督式学习方法,需要给足够标记的资料训练模型,资料标记往往需要浪费大量人力以及时间;而非监督式学习方法虽然不需要标记资讯,但是往往使用者在分群之前已经有些背景知识,理论上这些知识应该加入系统,让系统可以将分群导向正确方向,所以本论文加入少许标记资料,利用这些已知的资讯,来达到更好的效果,同时不用介入过多的人力来帮助标记资料。
本论文提出了一个半监督式的学习法,同时兼具了降维与分群,本论文的方法透过Constrained-pLSA去取得每笔文件的群别机率归属值,再利用这个归属值去结合LDA(Linear Discriminant Analysis),去寻找一个好的特征空间,使其分群效果提升。本论文在实际的问题上,使用了CiteUlike、20Newsgroups及Reuters资料集做分析,使用本论文提出的方法,将高维度的资料集降到低维度,再来分群,最后实验的结果显示只需要少许的标记资料就可以让本论文提出的方法有不错的效果。
Document classification is of great practical importance today given the massive volume of online text available. Supervised learning is one of the popular techniques for tackling document classification problems. However, sufficient labeled data is necessary for supervised learning methods to train a classification model. Labeling must typically be done manually and it is a time-consuming process obviously. In general, unlabeled data may be relatively easy to collect. Although unsupervised learning methods don’t need any labeled data, users often have some background knowledge before clustering. Practically, background knowledge should be considered in the algorithms to improve clustering accuracy.
This paper proposes a semi-supervised learning algorithm, which considers dimension reduction and clustering simultaneously. This paper applies constrained-pLSA to obtain soft labels , and then combines soft labels with linear discriminant analysis to find a better feature space. We conduct experiments on CiteUlike, 20Newsgroups, Reuters and experimental results indicate that the proposed method can effectively improve clustering performance.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079957541
http://hdl.handle.net/11536/50607
显示于类别:Thesis