标题: | 基于Constrained-PLSA之半监督式文件分群 Document Clustering with Labeled and Unlabeled Data Using Constrained-PLSA |
作者: | 陈俊宪 Chen, Chun-Hsien 李嘉晃 Lee, Chia-Hoang 资讯科学与工程研究所 |
关键字: | 半监督式学习;机器学习;标签分析;PLSA;machine learning;semi-supervised learning |
公开日期: | 2010 |
摘要: | 目前网路上的资料相当庞大,可轻易取得非常多未标记资料;然而监督式学习方法,需要给足够标记的资料做训练分类模型,资料标记往往需要浪费大量人力以及时间;而非监督式学习方法虽然不需要标记资料,但是往往使用者在分群之前已经有些背景知识,理论上这些知识应该加入系统,让系统可快速有效的分群,所以本论文加入少许标记的资料,利用这已知的资讯,来达到更好的效果,同时不用介入过多的人力来帮助资料的分群。本论文提出Constrained-PLSA,这是一种半监督式学习的演算法,将些许标记资讯整合加入Constrained-PLSA演算法中,利用标记的资讯引导未标记的资讯导向正确的方向,使分群效果提升。最后实验结果显示只要些许的标记资料可以让Constrained-PLSA达到稳定且不错的效果。另外本论文也用Constrained-PLSA探讨标签分析,利用论文资料集做实验,此资料集每篇文章包含了摘要和标签两个资讯,标签是由使用者看完文章后所给定的关键字,因此标签是一个很重要讯息;本论文分析出四种摘要和标签的组合方式:Words only、Tags only、Words+Tags和Tags as words,利用这几种组合方式做实验,并用不同的分群演算法来讨论分析哪个组合方式下,能使标签有最好效能提升效果,在此实验中也可看出Constrained-PLSA可以经由些许标记资料,有效提升分群效能。 Text classification is of great practical importance today given the massive volume of online text available. Supervised learning is one of the popular techniques for tackling text classification problems. However, enough labeled data is necessary for supervised learning methods. Labeling must typically be done manually and it is a time-consuming process obviously. In general, unlabeled data may be relatively easy to collect. Although unsupervised learning method doesn’t need any labeled data. But users often have some background knowledge before clustering. Practically, background knowledge should be included into algorithms to improve clustering accuracy. This paper extends PLSA clustering model to propose a Constrained-PLSA method, which is a semi-supervised learning algorithm. The Constrained-PLSA assumes that data is generated by a mixture model and the correspondence between each document and class label is one to one. By introducing the seeding documents as constraints, we show that Constrained-PLSA can estimate maximum likelihood in latent variable models using the Expectation Maximization (EM) algorithm. Experimental results show that Constrained-PLSA with a small amount of examples can effectively improve the performance. In addition, this paper also discusses tag usage using Constrained-PLSA. Academic paper data set is employed in this paper. Each paper consists of abstract and tag information. Tag is given by users after reading the article. This paper analyzes four combinations of abstracts and tags: “words only”, “tags only”, “words + tags” and “tags as words”. The best one is presented in this paper. Meanwhile, the experimental result shows that Constrained-PLSA outperforms other clustering algorithms. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079855611 http://hdl.handle.net/11536/48348 |
显示于类别: | Thesis |
文件中的档案:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.