标题: 以模糊理论与高频项目集为基础之文件分群研究
Fuzzy Frequent Itemset-based Textual Document Clustering
作者: 陈淳龄
Chen, Chun-Ling
梁 婷
曾守正
Liang, Tyne
Tseng, Frank S.C.
资讯科学与工程研究所
关键字: 文件分群;文字探勘;关联规则探勘;高频项目集;模糊集合理论;WordNet;Document Clustering;Text Mining;Association Rule Mining;Frequent Itemsets;Fuzzy Set Theory;WordNet
公开日期: 2009
摘要: 随着文字类型文件的数量大幅成长,文件分群技术可用来有效管理这些数量庞大的文件,以便于日后的检索及浏览。为了提升文件分群品质,近年来陆续有学者采用关联规则探勘技术所产生之高频项目集于文件分群方法中,解决了一般在文件分群中常遇到的高维度词汇、执行效能、分群正确性、和自动产生有意义之群集标签等多项问题。然而,采用关联规则探勘技术较容易忽略重要且出现频率较少的关键词汇,再者如项目间的关系程度太高,也会产生数量过多的高频项目集,造成分群执行时间过长。因此,本研究提出三个以模糊理论和高频项目集为基础的文件分群方法,主要是利用模糊关联规则探勘技术所产生之模糊高频项目集来有效降低词汇维度,并可依每个词汇在文件集中的散布情况和出现频率,区分为高频词、中频词或低频词。
本研究首先提出Fuzzy Frequent Itemset-based Hierarchical Document Clustering (F^2IHC) 方法,主要是利用模糊关联规则探勘技术找出关键词汇间的关联性,进而以模糊高频项目集来产生候选群集,并藉由计算文件与候选群集间的相似度来进行文件分群。此外,并将分群结果以阶层式群集树来呈现,使得归类好的群集具有容易浏览的特性。第二,为了能使用具概念性词汇来自动标注为群集标签,我们提出Fuzzy Frequent Itemset-based Document Clustering (F^2IDC) 方法,此方法结合WordNet探索关键词汇间的语意关系,并加入从WordNet中对应出的上位词 (hypernyms)于文件中,进而撷取出具概念性的群集标签来表示群集主题。第三,我们提出Fuzzy Frequent Itemset-based Soft Clustering (F^2ISC) 方法,此方法主要是扩充 F2IDC 方法,并采用模糊理论之α-cut法,能使一份文件分群到一至多个群集中。
在本研究的文件分群过程中,由于使用模糊高频项目集降低词汇维度,且所产生之模糊高频项目集并不会随着文件数而增加,所以可有效地应用于大文件集的分群上。与传统的分群方法相比较,实验结果显示本论文所提出之研究方法,能有效提高文件分群的正确性与效能,使得文件分群效果更加完善。
With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. However, there are still two situations to be confronted, if we use association rule mining in our approaches: (1) the important sparse key terms may be obscured; (2) too many itemsets will be produced, especially when items in the dataset are highly correlated. Moreover, frequent itemset-based clustering methods usually need a lot of time to generate the large number of itemsets. Considering the above two issues, we present three fuzzy frequent itemset-based document clustering approaches which using fuzzy association rule mining to provide significant dimensionality reduction over interesting fuzzy frequent itemsets. By applying fuzzy association rule mining, each term in the document dataset is labeled with a linguistic term, like Low, Mid, or High.
First, we propose the Fuzzy Frequent Itemset-based Hierarchical Document Clustering (F2IHC) approach, which employ fuzzy set theory for document representation to find suitable fuzzy frequent itemsets for clustering documents. In addition, F2IHC constructs a hierarchical cluster tree for providing flexible browsing. Second, in order to label clusters with conceptual terms, we present a Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach with the use of WordNet as background knowledge to explore better ways of representing document semantically for clustering. F2IDC presents a means of dynamically deriving a hierarchical organization of hypernymy from WordNet based on the content of each document without use of training data or standard clustering techniques. Third, we propose a Fuzzy Frequent Itemset-based Soft Clustering (F2ISC) approach by extending F2IDC under the consideration of overlapping clusters. F2ISC provides an accurate measure of confidence, and adopts the α-cut concept to assign each document to one or more than one cluster.
As a result, in the proposed clustering approaches, the interesting fuzzy frequent itemsets are used to reduce the dimensionality of term vectors. In addition, these itemsets do not increase with the growth of documents. Hence, our approaches perform better for large document collections. Our experimental results show that our proposed F2IHC, F2IDC, and F2ISC approaches indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079323804
http://hdl.handle.net/11536/40583
显示于类别:Thesis


文件中的档案:

  1. 380401.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.