標題: 高效益循序資料樣式探勘之研究
A Study on Efficient Mining of High Utility Sequential Patterns
作者: 王濬哲
黃俊龍
陳以錚
Wang, Jun-Zhe
Huang, Jiun-Long
Chen, Yi-Cheng
資訊科學與工程研究所
關鍵字: 高效益資料探勘;高效益循序資料探勘;前k高效益循序資料探勘;漸進式高效益循序資料探勘;High utility sequential pattern;high utility sequential pattern mining;top-k high utility sequential pattern mining;incremental high utility sequential pattern mining
公開日期: 2016
摘要: 循序資料樣式探勘(Sequential Pattern Mining)是資料探勘研究中一個非常重要且根本的研究議題,過去在這個議題上,有許多演算法被設計,目的是在資料庫中找出所有常出現的循序資料樣式,而這個議題亦在現實生活中有許多廣泛的應用。然而,常出現的循序資料樣式可能對使用者或企業較不具有價值,相對的,較罕見的高報酬,高成本,高評價,高風險 等具獨特性的循序資料樣式可能較為使用者或企業所重視。 有鑑於此,近年來,高效益循序資料樣式探勘(High Utility Sequential Pattern Mining)議題被提出,其目的是在序列資料庫中,找出高效益的循序資料樣式。有別於傳統的循序資料樣式探勘,在序列資料庫中,每個項目(item)會有一個效益值,可能代表利潤,成本...等可被量化的特性,表示使用者對此項目的重視程度。然而此問題並不容易解決,主要是由於(1)序列的效益值(sequence utility)不服從downward closure property,以及(2)某序列在其超序列中,可能含有多個案例,每個案例都會有其效益值。針對(1),已有序列加權效益(Sequence Weight Utility)被提出並且被應用到少數演算法中,進行搜尋空間中,候選序列的過濾(pruning),然而由於一個序列的序列加權效益通常遠大於其與其超序列的序列效益值,可能會造成大量不具高效益的序列被產生出,也因而造成演算法效能低落。然而針對(2),現今所有演算法都是將所有案例的效益值記下,如此一來,在探勘過程,可能會累積龐大的案例與效益值對,消耗龐大的系統記憶體資源。為了解決上述兩個問題,本論文在第一個研究主題上,提出了Prefix Extension Utility (PEU)與Reduced Sequence Utility (RSU)兩個較小的序列效益值上限來解決問題(1), 並且提出了Utility-Chain的資料結構來解決問題(2),同時可以加快效益值與序列效益上限的計算, 重要的是,我們結合了兩個序列效益上限與Utility-Chain,設計出HUS-Span演算法來探勘高效益循序資料樣式。實驗結果顯示,HUS-Span較既有方法在效能與記憶體資源使用上佔有優勢。 此外,由於設定門檻值對於較不具經驗的使用者,可能會造成困擾,而設定太高或太低的門檻,均可能會造成過少或過多的循序資料樣式產生。有鑑於此,已有Top-K高效益循序資料樣式探勘議題被提出,而在此問題上,既有方法基本上採用深度優先搜尋策略(Depth-First Search, DFS)來探勘Top-k樣式, 雖然深度優先搜尋較省記憶體,但其可能無法在探勘的過程中,較快找出Top-K樣式。有鑑於此,本論文在此議題上,提出了一個最佳優先搜尋策略(Best-First Search, BFS)的演算法,稱為TKHUS-SpanBFS,其結合PEU,RSU與Utility-Chain,並且永遠往搜尋空間中,所有已被產生但未被探索過,具有最大PEU的序列探索。實驗結果顯示,此演算法對於DFS演算法,具有相當顯著的效能優勢, 然而,最佳優先搜尋演算法可能會耗費大量的記憶體資源,因此,本論文在限制BFS的序列數目下,提出一個結合BFS與DFS的演算法,稱為TKHUS-SpanHybrid,以取得記憶資源與效能之間的平衡。 最後, 由於在現實生活中,新的資料往往隨時輸入資料庫。然而,每次資料庫更新後,便重新探勘資料庫以找出高效益循序資料樣式,尤其在更新的序列數目不多時,會顯得非常沒效率。因此,本論文的最後一個研究主題針對隨時累積的序列資料庫,進行漸進式探勘。為了減少多餘的重複運算,同時避免耗盡記憶體空間,我們提出了一個更小的序列效益上限:Tight Sequence Utility (TSU),來緩衝前次探勘結果。確切地說,我們設計一個樹狀結構緩衝舊資料庫中的高TSU序列,其優點是能減少緩衝的序列數目,並且結合Utility-Chain來設計每個序列所需儲存的資訊。針對需要更新的序列,我們設計了較有效率的節點更新方式,包含效益與效益上限等。同時,為了加速探勘效率,我們設計了資料庫掃瞄範圍減少策略,並且更新樹上相對應的節點資訊,以便支援後續的資料庫更新。結合上述幾種策略,我們提出了IncUsp-Miner演算法,在序列資料庫中進行漸進探勘高效益循序資料樣式。實驗結果顯示,在更新序列數目為原資料庫序列數目一半以下的情況下,IncUsp-Miner演算法可以有效率地探勘高效益循序資料樣式。
Sequential pattern mining is a fundamental research issue in data mining, which aims to find out all of the frequent subsequences in a sequence database. So far, many algorithms have been proposed to address this problem, and it has wide real world applications. However, frequent sequential patterns may not be informative to some users or business. By contrast, they may be interested in some rare patterns with high revenue, cost, etc. In view of this, the concept of high utility sequential pattern mining has been introduced recently, which refers to identify sequences with high utilities (e.g., profits) but probably with low frequencies in a sequence database. To identify high utility sequential patterns, due to lack of downward closure property in this problem, most existing algorithms first generate candidate sequences with high Sequence-Weighted Utilities (SWUs), which is an upper bound of the utilities of a sequence and all its supersequences, and then calculate the actual utilities of these candidates. This causes a large number of candidates generated since SWU is usually much larger than the real utilities of a sequence and all its supersequences. In view of this, we propose two tight utility upper bounds, Prefix Extension Utility (PEU) and Reduced Sequence Utility (RSU), as well as two companion pruning strategies, and devise HUS-Span algorithm to identify high utility sequential patterns by employing these two pruning strategies. Experimental results on some real and synthetic datasets show that HUS-Span is able to generate less candidate sequences, and thus outperforms other prior algorithms in terms of mining efficiency. In addition, since setting a proper utility threshold is usually difficult for users, we also propose algorithm TKHUS-Span to identify top-$k$ high utility sequential patterns by using these two pruning strategies. Three searching strategies, guided Depth-First Search (GDFS), Best-First Search (BFS), and hybrid search of BFS and GDFS, are also proposed to improve the efficiency of TKHUS-Span. Experimental results on some real and synthetic datasets show that TKHUS-Span with strategy BFS is able to explore less candidate sequences, and thus outperforms other prior algorithms in terms of mining efficiency. In practice, most sequence databases usually grow over time, and it is inefficient for existing algorithms to mine HUSPs from scratch when databases grow with a small portion of updates. In view of this, we propose the IncUSP-Miner algorithm to mine HUSPs incrementally. Specifically, to avoid redundant re-computations, we propose a tighter upper bound of the utility of a sequence, called Tight Sequence Utility (TSU), and then design a novel data structure, called the candidate pattern tree, to buffer the sequences whose TSU values are greater than or equal to the minimum utility threshold in the original database. Accordingly, to avoid keeping a huge amount of utility information for each sequence, a set of concise utility information is designed to be kept in each tree node. Moreover, several strategies are also proposed to reduce the amount of computation for utility update and the scopes of database scans, thereby improving the mining efficiency. Experimental results on some real and synthetic datasets show that IncUSP-Miner is able to efficiently mine high utility sequential patterns incrementally.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT079855846
http://hdl.handle.net/11536/140094
顯示於類別:畢業論文