標題: 時間間隔循序樣式與時空共同發生樣式之探勘與應用研究
Research on Interval-Based Sequence Pattern and Spatial-Temporal Co-Occurrence Pattern Mining and Implement
作者: 李素瑛
LEE SUH-YIN
國立交通大學資訊工程學系(所)
關鍵字: 循序樣式探勘;時間間隔循序樣式探勘;資料串流樣式探勘;時空資料探勘;共同發生樣式探勘
公開日期: 2011
摘要: 循序特徵樣式因其廣泛的實用性,在資料探勘的領域中一直扮演著非常重要的角色,藉著探勘出存在的時間順序對很多實務有很大的幫助。現今的演算法大都只考慮時間點(time point)的循序樣式探勘,並無時間的持續概念,處理時間間隔(time interval)的相關演算法與應用領域一直為學者專家所忽略。本計畫主要是著墨在處理時間間隔相關問題,將研究如何探勘以時間間隔為基礎的循序樣式,設計探勘演算法與研究應用領域,並討論所產生的相關議題,內容分述如下: 一、時間間隔為基礎的循序樣式探勘 我們將依下列兩個方向來設計與研究以時間間隔為基礎的循序樣式探勘: 1. 一般循序樣式探勘 連續時間序列的處理 本計畫所要處理的資料可以分為兩種,連續型或離散型。如所要處理的資料為離散型時間間隔序列(event sequences),則不需作任何的先前資料處理;但若所要處理的資料是屬於連續型的時間序列(time series),我們利用機率中常使用的最小平方迴歸線(least-squares regression line)方法,將原先的連續型時間序列資料轉換為離散型的事件序列(event sequence),以便後續時間間隔特徵樣式的探勘。 時間間隔關係的研究 Allen relation是目前表示時間間隔彼此關係最常用的方法,本計利用切割與反切割(incision and un-incision)的方法,將重疊的時間事件切割與組合成同一個同時片段(coincidence),有效地降低了搜尋空間的複雜度。但在作反切割的片段復原(un-cutting)時需要紀錄一些額外的資訊及處理。 時間間隔循序樣式的表示方法 表示方法(representation)是在處理時間間隔序列時,最基礎也是最重要的問題。對以時間間隔為基礎的循序樣式而言,單純依照發生時間的所排序出的前後關係,是無法表示出一個完整的時間間隔循序樣式係。目前的幾種表示方法,都有其相關的優缺點,本計畫冀望能整合目前幾種表示法的優點,在有效利用儲存空間的前提之下,配合切割方法,找出一個能表達樣式中時間間隔彼此的關係,並且不會產生混淆問題(ambiguous problem)的表示方法。 探勘時間間隔循序樣式的演算法 目前較少學者專家將心力投入在時間間隔資料的處理,主因是處理資料時的複雜度太高。我們設計了一個pattern-growth演算法,不需產生候選樣式(candidate)且只需兩次資料庫掃描,即可遞迴地找出所有的頻繁特徵樣式。能以少量的記憶體空間動態維護頻繁時間間隔循序樣式。透過配置與掃描此資料結構,本計畫設計出一個能有效率地從大型的事件序列資料庫中,正確地探勘出頻繁的時間間隔循序樣式,與相關延伸變化的樣式探勘演算法。 2. 表 C011 共 4 頁 第 2 頁 支持度的定義與研究 支持度在資料探勘的領域中是最基礎的一部分,在探勘循序特徵樣式時,不論是以時間點(time point)或是時間間隔(time interval)為基礎,都需要先行定義支持度如何計算。不同的支持度計算方式有不同的優缺點,所適合的應用領域也不盡相同,我們將提出新的支持度計算方法,綜合目前最常使用的兩種計算方式:發生支持度(occurrence support)與時間支持度(temporal support),將兩種方法的優點整合起來,冀望能更廣泛地應用在實作領域上。 二、資料串流上的時間間隔循序樣式探勘 目前並無任何在資料串流環境中,探勘時間間隔循序樣式的研究,主要是因為資料串流的許多限制,增加了探勘的複雜性。我們提出了兩個以切割與反切割為基礎的演算法,利用序列對齊、分群、與前綴樹搜尋的方法,找出近似的頻繁區段循序樣式。在序列對齊時,本計畫提出了固定對齊與動態對齊兩種方法,皆能有效且正確地將輸入序列累加起來。提出的演算法只需掃描一次串流資料,並且能在記憶體使用量的限制下,快速地探勘出近似頻繁時間間隔循序樣式。 三、時空資料上共同樣式與時間間隔樣式的探勘 因應用領域廣泛,探勘時空資料共同發生樣式(co-occurrence patterns)是目前越來越受矚目的研究議題。目前探勘時空資料樣式的相關研究仍多針對單一時間或空間屬性進行探勘,同時探討時間與空間屬性的研究則多在空間屬性上著墨。因此,本計劃將探討於大型時空資料庫中,探勘時間與空間屬性相依的測量方法、離散型共同發生樣式探勘、連續型共同發生樣式探勘、以及漸增式維護已探勘之時空資料頻繁樣式等問題。
Sequential pattern mining is a main topic in data mining domain, due to its widespread applicability. These applications always consider order relation and time issue in our daily lives. Existing sequential pattern mining assumes that events occur at time points and do not consider duration. However, events in many real world applications usually have durations, and the relationships among these events are often complex. Previous researches on mining sequential patterns mainly are focused on discovering patterns from time point-based event data. By comprehensible observation, we perceive that time point-based problem is just a special case in the time interval-based problem (where start time is identical to end time), but not vise versa. Our research mainly focuses on discovering frequent temporal interval-based patterns; we will also design new efficient algorithms and implement application systems. The details of this project are organized as follows: 1. Sequential pattern mining based on time intervals There are two types of data - continuous data and discrete data. If the data is discrete, we can process the database and mining pattern directly without any data preprocessing. On the contrary, we perform data preprocessing if the data is continuous time series. We investigate methods to transform the continuous time series into appropriate discrete event sequences which we can utilize to mine time interval patterns. The support plays an important role in the data mining. We should define the support beforehand, whether patterns are time point-based or time interval-based. The different support definitions on the basis of the various applications have their own strengths and weaknesses. In recent years, there are two general support definitions - occurrence support and temporal support calculation. We will propose a new support definition which combins the merits of occurrence support and temporal support. We will test the validity of measure using comprehensive datasets in various application domains. Allen relation is the most widespread way to present the relations among the time intervals. We will perform our algorithm – cutting and un-cutting mechanisms which can simplify the problems and present the relations among the time intervals effectively. In cutting step, we cut the time events which overlap the others into the same time segments for the sake of reducing the space complexity. Furthermore, we have to record additional information when we perform un-cutting. Our new approach can be applied efficiently and effectively in the various data stream applications and accurately discover frequent time interval patterns under several constraints on data stream. Representation is the most important and essential issue when dealing with the time interval sequence. However, it is impossible to represent the complete time interval sequential patterns only by dealing with the relations “before” and “after”, due to ambiguity caused by relation “overlap”. 表 C011 共 4 頁 第 4 頁 The state-of-the-art representations have their own strengths and weaknesses in both comprehensible expressions and memory usages. We will discover a new representation to present time intervals without generating the ambiguous problem and still on the premise that perform the efficiency of memory usage. Due to high complexity of processing time interval, few research are focused on mining interval-based temporal pattern. We proposed a new data structure, called Augmented Temporal Pattern Tree, ATP-tree which can dynamically maintain the frequent time interval patterns by using a small amount of memory space. We propose a new algorithm with depositing and scanning the ATP-tree structure. Proposed algorithm can accurately and efficiently explore the frequent time interval-based sequential patterns from a large event sequence database. 2. Time interval sequential pattern mining in data stream In recent years, various data mining techniques have been explored in the literature. One of the most popular data mining domains is stream mining. A data stream is a massive, open-ended sequence of data elements continuously generated at a rapid rate. Mining time interval-based temporal patterns is complicated and difficult in the data stream environment. Hence, we propose a new algorithm based on cutting and un-cutting mechanisms. Sequence alignment, clustering, and prefix-tree searching are main schemes for our approximate frequent pattern mining algorithm. When considering sequence alignment, we propose the static alignment and dynamic alignment methods which efficiently and accurately accumulate the sequences. We will only scan the data stream once, and discover frequent interval patterns under the constraints of memory usages. 3. Discover co-occurrence patterns and time interval patterns in spatial-temporal database. Because of the comprehensive application, mining co-occurrence patterns in spatial-temporal database have received a lot of attentions. However, existing works typically ignore the temporal aspect and suffer from certain efficiency problems. Hence, we study the problems of discovering the relations between time and space in large spatial-temporal databases. Furthermore, we also analyze the problems of mining the co-occurrence patterns in both discrete and continuous data, and efficiently maintaining the discovered spatial-temporal patterns.
官方說明文件#: NSC99-2221-E009-128-MY2
URI: http://hdl.handle.net/11536/99166
https://www.grb.gov.tw/search/planDetail?id=2225327&docId=357014
顯示於類別:研究計畫