標題: 運用半監督式學習法之文件探勘研究
Text Mining with Semi-Supervised Learning
作者: 蕭文豪
Hsaio, Wen-Hoar
李嘉晃
劉建良
Lee, Chia-Hoang
Liu, Chien-Liang
資訊科學與工程研究所
關鍵字: 機器學習;監督式學習;非監督式學習;半監督式學習;Universum學習;稀疏編碼;Machine Learning;Supervised Learning;Unsupervised Learning;Semi-supervised Learning;Universum Learning;Sparse Coding
公開日期: 2015
摘要: 隨著網際網路的蓬勃發展,網路上已充斥大量各種來源的資料,其中文件資料及部落格文章是由人們為特定目的撰寫產生的,具有一定的語意資訊。如何有效及自動地組織這些資訊已經是機器學習界相當具有吸引力的研究領域。半監督式學習法,結合有標示及無標示資料的一種學習法,被視為是介於監督式學習法及非監督式學習法間的一種機器學習法,它已經是機器學習活躍的研究領域,並且過去十年間已吸引大量研究人員的關注。除此之外,在訊號處理及電腦視覺研究領域,稀疏表示法已被證明是有關獲得、表示及壓縮高維度物件的一種非常有用工具。同時,非標的資料學習法利用非標的資料不同於標的資料分布之假設,可用來估計模型的先驗參數,在機器學習研究領域已是一門熱門的研究議題。本論文主要為運用半監督式學習法之文件探勘研究,並且提出四種新的半監督式學習法,包括Constrained-PLSA、SSS-MF、Semi-LDC及ԱSemi-AdaBoost.MH。為了評估本論文所提出的半監督式學習法,本論文針對四個有名的文件集進行實驗,並且與其他幾種有名的半監督式學習法比較,實驗結果顯示,就這些所給的文件資料集,我們提出來的新方法優於其他幾種被比較的方法。
As the Internet grows, many overwhelming information sources, including the documents and blog articles, are available on the web. These information sources comprise a lot of semantic information, since they are originally created to deliver information to the people. How to effectively and automatically organize these articles or documents has been an attractive research field for the machine learning community. Semi-supervised learning, learning from a combination of both labeled and unlabeled data, is a machine learning approach between unsupervised learning and supervised learning. It has recently became an active research area in machine learning and received a lot of attention over the last decade. Besides, sparse representations have proven to be an extremely powerful tool for acquiring, representing, and compressing high-dimensional objects in signal processing and computer vision. Moreover, learning with Universum, which uses the examples with different distributions to the target ones to estimate prior model information, is a popular research subject in machine learning. This thesis focuses on text mining with semi-supervised learning to propose four semi-supervised learning algorithms, which are Constrained-PLSA, SSS-MF, Semi-LDC and ԱSemi-AdaBoost.MH. This thesis conducts experiments on four famous real data sets and uses several state-of-the-art semi-supervised learning algorithms to compare with the proposed algorithms. The experimental results indicate that the proposed method generally outperforms the other compared semi-supervised learning methods on given data sets.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079655808
http://hdl.handle.net/11536/126270
Appears in Collections:Thesis