標題: 有效率之關聯規則勘測與循序樣式勘測方法
Efficient Algorithms for Association Rule Mining and Sequential Pattern Mining
作者: 林明言
Ming-Yen Lin
李素瑛
Suh-Yin Lee
資訊科學與工程研究所
關鍵字: 資料探勘;關聯規則;循序樣式;時間限制;互動式序列勘測;漸進式更新;data mining;association rules;sequential patterns;time constraints;interactive sequence mining;incremental update
公開日期: 2003
摘要: 自動化與電腦的應用,讓資料收集無時不刻、隨時隨地的進行,也造成資料大量且迅速增 長。隱含於這些巨量資料中的豐富資訊,吸引各領域的學者研發各種粹取其中有用知識的 方法。在眾多資料探勘的目標中,頻繁樣式的勘測一直是資料庫中知識挖掘的研究焦點。 本論文主旨在於研發有效率的關聯規則及循序樣式探勘方法。 首先,我們提出LexMiner 演算法以找出推演關聯規則的頻繁項目集。為了免除hash-tree 擺置可能頻繁項目集時的缺點,有些方法將可能項目集依項目的prefix-order 擺放 。LexMiner 進一步利用項目集的字典序特性與字典式比較以加速探勘演算法中的核心運算 —尋找交易紀錄中包含之可能頻繁項目集。 探勘循序樣式是本論文所探討的第二個主題。我們提出一個記憶體索引的方法,稱為 MEMISP,利用「尋找再索引」的技巧來快速探勘循序樣式。無論資料庫大小、無論支持度 多小,MEMISP 最多僅需檢視資料庫兩回合即可完成探勘。MEMISP優於其他方法的因素在於 不產生可能樣式、也不產生暫時的中間資料庫。 探勘具有時間限制的循序樣式,包括時間差與滑動時間窗,可以強化結果的精確性。過去 僅有Apriori 架構可以解決此問題。近來許多研究顯示pattern-growth方法可以有效改善 探勘速度。因此,我們提出 DELISP 演算法,在pattern-growth方法論下,利用divide- and-conquer 策略,整合限制於子資料庫投射,更有效率的完成具時間限制循序樣式之探 勘。 知識挖掘原就是一種挖掘、檢視、再挖掘反覆進行互動的過程。如何減少使用者找到合意 結果之交談過程中的反應時間相當重要。我們所提KISP 演算法利用進行過程中所得的資訊 ,累積計數的資訊以促成有效率之樣式計數運算,並加速整個互動式序列探勘程序。 目前循序樣式的探勘往往假設勘測的資料庫是不變動的。然而,資料庫會有資料更新變動 ,以致過去找出的樣式會變成無效或新樣式可能會產生。本論文所提的IncSP方法不需因資 料變動而整個從頭開始重新探勘。我們透過隱含式合併與對新增序列有效率分開計數,將 過去的樣式漸進式地更新。 我們進行了大規模完整的實驗以評估所提各方法的效能。在我們的實驗範圍中,結果顯示 ,對於各個不同探勘參數及不同特性的資料集,我們的方法都優於許多著名的方法。針對 資料量擴充的實驗也顯示我們探勘頻繁樣式的方法具有線性擴充能力。
Tremendous amount of data being collected is increasing speedily by computerized applications around the world. Hidden in the vast data, the valuable information is attracting researchers of multiple disciplines to study effective approaches to derive useful knowledge from within. Among various data mining objectives, the mining of frequent patterns has been the focus of knowledge discovery in databases. This thesis aims to investigate efficient algorithms for mining frequent patterns including association rules and sequential patterns. We propose the LexMiner algorithm to deal with frequent item-set discovery for association rules. To alleviate the drawbacks of hash-tree placement of candidates, some algorithms store candidate patterns according to prefix-order of itemsets. LexMiner utilizes the lexicographic features and lexicographic comparisons to further speed up the kernel operation of mining algorithms. A memory indexing approach called MEMISP is proposed for fast sequential pattern mining using a find-then-index technique. MEMISP mines databases of any size, with respect to any support threshold, in just two passes of database scanning. MEMISP outperforms other algorithms in that neither candidate patterns nor intermediate databases are generated. Mining sequential patterns with time constraints, such as time gaps and sliding time-window, may reinforce the accuracy of mining results. However, the capabilities to mine the time-constrained patterns were previously available only within Apriori framework. Recent studies indicate that pattern- growth methodology could speed up sequence mining. We integrate the constraints into a divide-and-conquer strategy of sub-database projection and propose the pattern-growth based DELISP algorithm, which outperforms other algorithms in mining time-constrained sequential patterns. In practice, knowledge discovery is an iterative process. Thus, reducing the response time during user interactions for the desired outcome is crucial. The proposed KISP algorithm utilizes the knowledge acquired from individual mining process, accumulates the counting information to facilitate efficient counting of patterns, and accelerates the whole interactive sequence mining process. Current approaches for sequential pattern mining usually assume that the mining is performed with respect to a static sequence database. However, databases are not static due to update so that the discovered patterns might become invalid and new patterns could be created. Instead of re-mining from scratch, the proposed IncSP algorithm solves the incremental update problem through effective implicit merging and efficient separate counting over appended sequences. Patterns found in prior stages are incrementally updated rather than re-mining. Comprehensive experiments have been conducted to assess the performance of the proposed algorithms. The empirical results show that these algorithms outperform state-of-the-art algorithms with respect to various mining parameters and datasets of different characteristics. The scale-up experiments also verify that our algorithms successfully mine frequent patterns with good linear scalability.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT008517815
http://hdl.handle.net/11536/70111
顯示於類別:畢業論文


文件中的檔案:

  1. 781501.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。