標題: | 一個提昇分類演算法探勘概念漂移資料效能之研究 A Study to Improve the Performance of Classification for Mining Concept-Drifting Data |
作者: | 蔡政容 Cheng-Jung Tsai 楊維邦 李建億 Wei-Pang Yang Chien-I Lee 資訊科學與工程研究所 |
關鍵字: | 資料探勘;分類;決策樹;資料串流探勘;概念漂移;概念漂移規則;離散化;多重屬性值;多重類別;Data mining;classification;decision tree;data stream mining;concept drift;concept-drifting rule;discretization;multi-valued;multi-labeled |
公開日期: | 2007 |
摘要: | 隨著資料數位化技術的快速發展,近年來資料探勘技術已被廣泛地運用,以從龐雜的資料中萃取出有用的資訊。資料探勘可細分為數個研究領域如關聯法則、分類、群集…等,其中分類技術對於預測未知的資料扮演著十分重要的角色並已成功的運用到現實生活中。分類的研究領域包含了許多重要的研究議題,如可調適性、不平衡資料集、合議分類器、關聯資料庫探勘、隱私權…等。因為目前許多日常生活中的資料是以接連不斷的資料區塊之方式呈現,學者愈來愈重視資料串流的探勘。已提出之探勘資料串流的方法,大部份都假設資料區塊是呈現平穩分佈。然而此假設是不合理的,因為資料的概念可能會隨著時間而改變。此種資料概念隨著時間遞延而改變的情形,便稱之為概念漂移。概念漂移的發生,使得利用資料串流建構分類器的工作變得更複雜。
目前已提出用來探勘具有概念漂移之資料串流的方法,共同的缺點之一是當資料串流十分穩定時會耗損許多不必要的系統成本:包括重建分類器的計算成本或紀錄具有類似區塊的記錄成本。此外,這些方法對於含有概念漂移樣本之資料串流皆不具有高敏銳度:這些方法只有在產生漂移的樣本達到一臨界值時才會發現概念漂移的產生;此外,這些方法皆無法解決雙向漂移的問題。對於某些即時的應用如電腦病毒偵測,一個具有高敏銳度的演算法是很重要的。因此,本論文提出SCRIPT演算法來處理具有概念漂移的資料串流。SCRIPT演算法對於具有概念漂移的資料串流可以建立更準確且更具敏感度的分類器。
此外,目前已提出能處理概念漂移資料的演算法,全都只著重在正確地更新原有的分類器以維持預測的準確性。但對使用者而言,其可能對於引起概念漂移的規則更感興趣。探勘概念漂移規則是一個十分有趣且實用的議題。例如:醫生會想了解引起疾病變化的主因、學者會想要知道氣候轉變的規則、或是決策者會想找出顧客購物習慣改變的因素...等。然而此一問題在過去卻被忽略掉。為了解決這個問題,本論文提出CDR-Tree演算法來探勘出造成概念漂移的規則。CDR-Tree演算法的另一個特點在於,當使用者需要檢視或運用分類器時,其能經由簡單的抽取程序快速且正確地產生資料的分類器而不須重新建構。
最後,為了減少CDR-Tree的建構時間並簡化其所產生的概念漂移規則,本論文針對離散化的問題進行探討。離散化是一個將數值屬性切割成多個有限區間的技術。離散化技術對於須處理大量資料和只能處理類別屬性的學習演算法扮演著十分重要的角色。過去的實驗結果顯示離散化技術不但能加速學習演算法,同時也能維持甚至改進學習演算法所建構出之分類器的準確度。然而,目前已知的離散化演算法皆只能處理單一屬性值和單一類別的資料,並無法離散CDR-Tree所使用的多重屬性值和多重類別的資料。因此,本論文提出了能處理多重屬性值和多重類別資料的離散化演算法OMMD。 With the rapid growth of electronic information, data mining has been widely applied to the identification of useful knowledge from the huge bank of extant data. Among several functionalities of data mining, classification is crucially important for the predication of unseen data, and has been successfully utilized in several real-world applications. In the research domain of classification, there are several important issues to be addressed; including scalability, imbalanced datasets, ensemble classifiers, multi-relational databases mining, privacy-preservation, and so on. Since many real-world data nowadays come in the form of consecutive data blocks, researchers have focused increasing attention on data stream mining. Most proposed approaches to data stream mining assume that data blocks are to be obtained in stationary distributions. This assumption is unreasonable since the distribution underlying the data is likely to change over time. This problem, known as concept drift, complicates the task of learning a classification model from data streams. Proposed solutions to mine concept-drifting data streams consume some unnecessary system resources, including computational costs to rebuild the decision tree or storage costs to record similar data blocks. In addition, they generally do not sufficiently account for the problem of concept drift: a) the proposed solutions can detect the changes until the number of drifting instances reaches a threshold to cause obvious difference in accuracy or information gain or gini index; b) the proposed solutions would make a wrong estimation when there are two-way drifts. For some real-time applications such as computer virus detection, a sensitive approach to detecting drifting concepts is very important. In this dissertation, we propose a sensitive concept drift probing decision tree algorithm named SCRIPT to accurately and sensitively mine concept-drifting data streams. Then, we address proposed solutions for mining concept-drifting data that focus only on accurately updating the classification model. They are deficient in that they are unable to provide users with a satisfactory solution to concept drift; the rules of which may be of considerable interest to some users. Mining concept-drifting rules is both of great academic interest and high practically applicability. Some examples include: doctors desiring to know the root causes behind variations in the causes and development of disease, scholars longing for the rules underlying weather transition, and sellers wanting to discern the reasons why consumers’ shopping habits change. However, this issue was ignored in the past. We propose a concept drift rule mining tree algorithm named CDR-Tree to accurately elucidate the underlying rules governing concept drift. Another important characteristic of CDR-Tree is that it can efficiently and accurately generate classification models via a simple extraction procedure instead of building them from scratch should classification models be required. Finally, in order to speed up the building procedure and simplify the rules produced by CDR-Tree, we focus our attention on discretization techniques. Discretization is a technique for reducing the number of values for a given continuous attribute by dividing the range of the attribute into a finite set of adjacent intervals. It is an important data preprocessing technique for data mining algorithms which are highly sensitive to the size of data or are only able to handle categorical attributes. Empirical evaluations show that the discretization technique has the potential to speed up the learning process while retaining or even improving predictive accuracy of learning algorithms. Unfortunately, all of proposed discretization algorithms are designed to handle only single-valued and single-labeled data and are infeasible to discretize the multi-valued and multi-labeled data used in CDR-Tree. We propose a new discretization approach named OMMD (ordered multi-valued and multi-labeled discretization algorithm) as the solution to discretize the input data of CDR-Tree. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT008923814 http://hdl.handle.net/11536/78247 |
顯示於類別: | 畢業論文 |