標題: | 在未標號圖形資料中高效率的圖形探勘演算法 An Efficient Algorithm for Mining Frequent Unlabeled Graphs |
作者: | 李銘 Lee, Ming 李素瑛 Lee, Suh-Yin 資訊科學與工程研究所 |
關鍵字: | 圖型探勘;資料探勘;Graph mining;Data mining;Frequent Pattern mining |
公開日期: | 2009 |
摘要: | 近年來頻繁樣式探勘由頻繁項目集探勘及循序樣式探勘逐漸發展至探勘更具結構性之資料,例如樹、絡及圖形等資料。圖形可用於表示許多複雜的資料及資料間的關係,因此被廣泛應用於許多領域,如化學資訊學、生物資訊學及網路探索等等。由於圖形資料的高複雜度,其子圖的個數呈指數成長,如何有效率的探勘大型圖形樣式是一項重大的挑戰。在過去十年間,發展出了許多圖形探勘演算法,我們研究目前已知最好的演算法 gSpan ,希望能更進一步改善其效率。gSpan 採用 right-most 延伸法來產生候選圖形,與傳統方法比較,其產生之重覆的候選圖形數目顯著減少。但由於 right-most 延伸法原本是用於有根樹的探勘演算法,以此為基礎用於圖形探勘,仍會產生許多重覆的候選圖形。此情況造成gSpan的效能降低,尤其在針對標記種類較少、或甚至未標號的圖形做處理時,效能的降低更為顯著。因此我們修正此延伸法以進一步減少候選圖形個數,並將之用於未標號圖形資料中。我們對合成資料以及實際之化學化合資料進行實驗,驗證了所提演算法之正確性,並且有效的減少了候選圖形的個數。 In recent years, academic research on pattern discovery progressed from mining frequent itemsets and sequences to mining structured patterns including trees, lattices, and graphs. Among them, graphs serve as a general model to represent data and have consequently been widely and extensively used in many domains like cheminformatics, bioinformatics, web exploration, etc. However, mining large graph patterns effectively and efficiently is challenging due to the presence of an exponential number of frequent subgraphs and the complexity of graph data. In response to the dilemma, several graph miners, such as gSpan, have been published in the last decade and all of these newly developed graph miners work on undirected labeled simple graph. By examining the state-of-the-art pattern growth algorithm gSpan, which uses a right-most extension method, we learn that it dramatically reduces the number of candidate graphs, compared with traditional join method. Nevertheless, right-most extension was designed originally for rooted trees. It still generates many duplicate graphs when being applied to a free graph. This problem causes remarkable performance degradation when the graphs are denser and with fewer labels available, especially when the graphs are unlabeled. We propose a modified extension method to further decrease the number of duplicate graphs, and apply it to unlabeled graph datasets, hoping to make gSpan an even better miner than it already is. We present the results on mining both synthetic datasets and a chemical compound dataset. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079555584 http://hdl.handle.net/11536/41417 |
顯示於類別: | 畢業論文 |