標題: | 在串流資料中高效率頻繁樣式探勘演算法之研究 A Study of Efficient Mining Algorithms of Frequent Patterns on Data Streams |
作者: | 李華富 Hua-Fu Li 李素瑛 Suh-Yin Lee 資訊科學與工程研究所 |
關鍵字: | 資料探勘;串流資料;頻繁項目集合;變化探勘;路徑瀏覽樣式;Data mining;Data streams;Frequent itemsets;Change mining;Path traversal patterns |
公開日期: | 2005 |
摘要: | 資料串流探勘是一個快速成長的新興研究領域,但同時也帶來了新的挑戰。在眾多資料串流探勘的研究中,頻繁樣式探勘與變化探勘一直是資料串流探勘中重要的研究焦點。本論文主旨在於研發高效率的頻繁項目集合、路徑瀏覽樣式,以及項目變化樣式的單次線上掃描探勘方法。
頻繁項目集合是本論文所探討的第一個研究主題。首先,針對串流資料的標的物模型,我們提出一個可快速地探勘出所有頻繁項目集合的單次掃描演算法DSM-FI。為了避免將每一筆新進交易中所有項目集組合都窮舉出來,我們設計了一個有效的交易投影機制,並提出一個新的摘要字首樹狀結構來儲存必要的集合項目資訊。此演算法在探勘出頻繁項目集合的同時,也可找出最大頻繁項目集合。
我們也針對串流資料的滑動窗模型提出了兩個單次頻繁項目集合探勘演算法MFI-TransSW與MFI-TimeSW,可有效地在交易感知滑動窗模型以及時間感知滑動窗模型中探勘出目前存在的頻繁項目集合。MFI-TransSW與MFI-TimeSW演算法主要是利用位元向量的特性來儲存單一項目在目前滑動窗中的出現位置,並利用位元向量的特性來達到快速滑動的效果。
路徑瀏覽樣式是本論文所探討的第二個主題。我們提出可快速探勘出路徑瀏覽樣式的單次掃描演算法DSM-PLW,此演算法沿用DSM-FI的精神,利用快速拆解使用者瀏覽路徑,以及字首樹狀結構的特性,來達到串流資料探勘的效能要求。此外,我們進一步提出一個不需要使用者輸入最小支持度門檻值,就可以進行的Top-K路徑瀏覽樣式的單次掃描探勘演算法。
串流資料變化探勘是本論文的第三個研究主題。我們提出MFC-append演算法來找出在兩條線上交易資料串流中,穩定分布的交易項目、常常變動的項目,或無一定分佈的變化樣式。此外,針對可執行刪除運算的動態資料串流,我們提出一個以MFC-append為基礎的演算法MFC-dynamic,來探勘動態資料串流中的項目變化。
我們進行了相關的實驗以評估所提方法的效能。在我們的實驗範圍中的結果顯示,對於各個不同探勘參數以及不同特性的資料集,我們的方法都優於許多著名的方法。此外,針對資料量擴充的實驗也顯示出我們所提出的探勘頻繁樣式的方法具有線性的擴充能力。 Online mining of data streams is an important data mining problem with broad applications. However, it is a difficult problem since the streaming data possess some specific characteristics, such as unknown or unbounded length, possibly very fast arrival rate, inability to backtrack over previously arrived transactions, and a lack of system control over the order in which data arrives. Among various objectives of data stream mining, the mining of frequent patterns in data streams has been the focus of knowledge discovery. In this dissertation, the design of several core technologies for mining frequent patterns and changes of data streams is investigated. For mining of frequent itemsets over data streams with a landmark window, we propose the DSM-FI (Data Stream Mining for Frequent Itemsets) algorithm to find the set of all frequent itemsets over the entire history of the data streams. An effective projection method is used in the proposed algorithm to extract the essential information from each incoming transaction of the data streams. A data structure based on prefix tree is constructed to store data summary. DSM-FI utilizes a top-down pattern selection approach to find the complete set of frequent itemsets. Experiments show that DSM-FI outperforms BTS (Buffer-Trie-SetGen), a state-of-the-art single-pass algorithm, by one order of magnitude for discovering the set of all frequent itemsets over a landmark window of data streams. For mining of frequent itemsets in data streams with a sliding window, efficient bit vector based algorithms are proposed. Two kinds of sliding windows, i.e., transaction-sensitive sliding window and time-sensitive sliding window, are discussed. MFI-TransSW (Mining Frequent Itemsets over a Transaction-sensitive Sliding Window) is developed to mine the set frequent itemsets over data streams with a transaction-sensitive sliding window. A single-pass algorithm, called MFI-TimeSW (Mining Frequent Itemsets over a Time-sensitive Sliding Window), based on MFI-TransSW algorithm and a dynamic encoding method is proposed to mine the set of frequent itemsets in a time-sensitive sliding window. An effective bit-sequence representation of items is used in the proposed algorithms to reduce the time and memory needed to slide the windows. Experiments show that the proposed algorithms not only attain highly accurate mining results, but also run significantly faster and consume less memory than existing algorithms for mining recent frequent itemsets over data streams. For mining changes of items across two data streams, we propose two one-pass algorithms, called MFC-append (Mining Frequency Changes of append-only data streams) and MFC-dynamic (Mining Frequency Changes of dynamic data streams), to mine the set of frequent frequency changed items, vibrated frequency changed items, and stable frequency changed items across two continuous append-only and dynamic data streams, respectively. A new summary data structure, called Change-Sketch, is developed to compute the frequency changes between two data streams as fast as possible. Theoretical analysis and experimental results show that our algorithms meet the major performance requirements, namely single-pass, bounded space requirement, and real-time computing, in mining data streams. Mining path traversal patterns from Web click streams is important in Web usage mining and Web user profiling. One of the most important We proposed two single-pass algorithms, called DSM-PLW (Data Stream Mining for Path traversal patterns in a Landmark Window) and DSM-TKP (Data Stream Mining for Top-K Path traversal patterns), to discover the path traversal patterns over Web click-streams with and without a user-defined minimum support constraint. Experiments of real data show that both algorithms successfully mine maximal reference sequences with linear scalability. Comprehensive experiments have been conducted to assess the performance of the proposed algorithms. The empirical results show that these algorithms outperform the state-of-the-art algorithms with respect to various mining parameters and datasets of different characteristics. The scale-up experiments also verify that our algorithms successfully mine frequent patterns with good linear scalability. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT009017811 http://hdl.handle.net/11536/81736 |
顯示於類別: | 畢業論文 |