以模糊理論與高頻項目集為基礎之文件分群研究

Full metadata record

DC Field	Value	Language
dc.contributor.author	陳淳齡	en_US
dc.contributor.author	Chen, Chun-Ling	en_US
dc.contributor.author	梁婷	en_US
dc.contributor.author	曾守正	en_US
dc.contributor.author	Liang, Tyne	en_US
dc.contributor.author	Tseng, Frank S.C.	en_US
dc.date.accessioned	2014-12-12T01:22:57Z	-
dc.date.available	2014-12-12T01:22:57Z	-
dc.date.issued	2009	en_US
dc.identifier.uri	http://140.113.39.130/cdrfb3/record/nctu/#GT079323804	en_US
dc.identifier.uri	http://hdl.handle.net/11536/40583	-
dc.description.abstract	隨著文字類型文件的數量大幅成長，文件分群技術可用來有效管理這些數量龐大的文件，以便於日後的檢索及瀏覽。為了提升文件分群品質，近年來陸續有學者採用關聯規則探勘技術所產生之高頻項目集於文件分群方法中，解決了一般在文件分群中常遇到的高維度詞彙、執行效能、分群正確性、和自動產生有意義之群集標籤等多項問題。然而，採用關聯規則探勘技術較容易忽略重要且出現頻率較少的關鍵詞彙，再者如項目間的關係程度太高，也會產生數量過多的高頻項目集，造成分群執行時間過長。因此，本研究提出三個以模糊理論和高頻項目集為基礎的文件分群方法，主要是利用模糊關聯規則探勘技術所產生之模糊高頻項目集來有效降低詞彙維度，並可依每個詞彙在文件集中的散佈情況和出現頻率，區分為高頻詞、中頻詞或低頻詞。本研究首先提出Fuzzy Frequent Itemset-based Hierarchical Document Clustering (F^2IHC) 方法，主要是利用模糊關聯規則探勘技術找出關鍵詞彙間的關聯性，進而以模糊高頻項目集來產生候選群集，並藉由計算文件與候選群集間的相似度來進行文件分群。此外，並將分群結果以階層式群集樹來呈現，使得歸類好的群集具有容易瀏覽的特性。第二，為了能使用具概念性詞彙來自動標註為群集標籤，我們提出Fuzzy Frequent Itemset-based Document Clustering (F^2IDC) 方法，此方法結合WordNet探索關鍵詞彙間的語意關係，並加入從WordNet中對應出的上位詞 (hypernyms)於文件中，進而擷取出具概念性的群集標籤來表示群集主題。第三，我們提出Fuzzy Frequent Itemset-based Soft Clustering (F^2ISC) 方法，此方法主要是擴充 F2IDC 方法，並採用模糊理論之α-cut法，能使一份文件分群到一至多個群集中。在本研究的文件分群過程中，由於使用模糊高頻項目集降低詞彙維度，且所產生之模糊高頻項目集並不會隨著文件數而增加，所以可有效地應用於大文件集的分群上。與傳統的分群方法相比較，實驗結果顯示本論文所提出之研究方法，能有效提高文件分群的正確性與效能，使得文件分群效果更加完善。	zh_TW
dc.description.abstract	With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. However, there are still two situations to be confronted, if we use association rule mining in our approaches: (1) the important sparse key terms may be obscured; (2) too many itemsets will be produced, especially when items in the dataset are highly correlated. Moreover, frequent itemset-based clustering methods usually need a lot of time to generate the large number of itemsets. Considering the above two issues, we present three fuzzy frequent itemset-based document clustering approaches which using fuzzy association rule mining to provide significant dimensionality reduction over interesting fuzzy frequent itemsets. By applying fuzzy association rule mining, each term in the document dataset is labeled with a linguistic term, like Low, Mid, or High. First, we propose the Fuzzy Frequent Itemset-based Hierarchical Document Clustering (F2IHC) approach, which employ fuzzy set theory for document representation to find suitable fuzzy frequent itemsets for clustering documents. In addition, F2IHC constructs a hierarchical cluster tree for providing flexible browsing. Second, in order to label clusters with conceptual terms, we present a Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach with the use of WordNet as background knowledge to explore better ways of representing document semantically for clustering. F2IDC presents a means of dynamically deriving a hierarchical organization of hypernymy from WordNet based on the content of each document without use of training data or standard clustering techniques. Third, we propose a Fuzzy Frequent Itemset-based Soft Clustering (F2ISC) approach by extending F2IDC under the consideration of overlapping clusters. F2ISC provides an accurate measure of confidence, and adopts the α-cut concept to assign each document to one or more than one cluster. As a result, in the proposed clustering approaches, the interesting fuzzy frequent itemsets are used to reduce the dimensionality of term vectors. In addition, these itemsets do not increase with the growth of documents. Hence, our approaches perform better for large document collections. Our experimental results show that our proposed F2IHC, F2IDC, and F2ISC approaches indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.	en_US
dc.language.iso	en_US	en_US
dc.subject	文件分群	zh_TW
dc.subject	文字探勘	zh_TW
dc.subject	關聯規則探勘	zh_TW
dc.subject	高頻項目集	zh_TW
dc.subject	模糊集合理論	zh_TW
dc.subject	WordNet	zh_TW
dc.subject	Document Clustering	en_US
dc.subject	Text Mining	en_US
dc.subject	Association Rule Mining	en_US
dc.subject	Frequent Itemsets	en_US
dc.subject	Fuzzy Set Theory	en_US
dc.subject	WordNet	en_US
dc.title	以模糊理論與高頻項目集為基礎之文件分群研究	zh_TW
dc.title	Fuzzy Frequent Itemset-based Textual Document Clustering	en_US
dc.type	Thesis	en_US
dc.contributor.department	資訊科學與工程研究所	zh_TW
Appears in Collections:	Thesis

Files in This Item:

380401.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.