標題: 在大型資料庫中探勘容錯頻繁樣式
Mining Fault-Tolerant Frequent Patterns in Large Databases
作者: 王聖舜
Sheng-Shun Wang
李素瑛
Suh-Yin Lee
資訊科學與工程研究所
關鍵字: 資料探勘;容錯頻繁樣式;Data mining;Fault tolerant frequent pattern;FT-contain;support
公開日期: 2001
摘要: 有鑑於現實世界中的資料來源可能會因受到雜訊的干擾而導致資料的不正確。此外, 我們可能希望所找出來的資訊是較具普遍性。因此, FT-Aprori 用來解決在大型現實資料庫中容錯頻繁樣式(fault-tolerant frequent pattern)的探勘。然而,FT-Apriori是根據Apriori 性質採用候選樣式產生及測試(candidate generating-and-testing)方式,因此較無效率。 在本篇論文中,我們採用樣式成長(pattern growth)的方式發展較有效率的容錯頻繁樣式探勘演算法FTP-mine。首先我們考慮所有的資料都能儲存在記憶體的FTP-mine,我們設計一個STable用來計算具有相同字首樣式的item support 和 FT-support. 對於無法將資料儲存於記憶體的大型資料庫, FTP-mine 利用切割資料庫的方式來探勘容錯頻繁樣式. 此外,考慮到所找出來的容錯頻繁樣式可能會很多而且會有互相包含的情況,因此我們延伸FTP-mine 找出最大的容錯頻繁樣式。 在本論文中,我們針對不同的support、tolerance和scalability設計各種實驗,從結果可以看出FTP-mine在各種條件下都比FT-Apriori有較好的執行效率和linear scalability。
In view of real world data may be interfered with noise which leads data to contain faults. The data mining methods proposed previously may not be applicable. Besides, we may hope that the knowledge discovered is more general and can be applied to find more interesting information. Hence, FT-Aprori was proposed for fault-tolerant data mining to discover information over large real-world data. However, FT-Apriori which generates and tests candidates based on Apriori property is not so efficient. In this paper, we develop memory-based algorithm FTP-mine which is based on the concept of pattern growth to mine fault-tolerant frequent patterns efficiently. In FTP-mine, the table, STable, is designed to count the item support and FT-support of the k-length patterns which have the same prefix of length k-1 by comparing transaction once. As to mining in a large database which is too large to fit in memory, FTP-mine also can be adopted by means of database partition. In addition, since there might exist a large number of fault tolerant frequent patterns and some may be contained in others, we also focus on the finding of maximal FT-frequent patterns by extending the FTP-mine algorithm. Our study shows that FTP-mine has higher performance than FT-Apriori in all kinds of parameter settings, such as various supports, tolerance, and scalability. The empirical evaluations show that the proposed method has good linear scalability and outperforms FT-Apriori in the discovery of FT-frequent pattern.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT900392051
http://hdl.handle.net/11536/68465
Appears in Collections:Thesis