標題: 應用集群分析和自組性演算法發展類別不對稱資料之減少多數抽樣策略
Developing an Under-sampling Strategy for a Classification Model with Imbalanced Data using Cluster Analysis and GMDH
作者: 趙培鈞
Chao, Pei-Chun
唐麗英
李榮貴
Tong, Lee-Ing
Li, Rong-Kwei
工業工程與管理學系
關鍵字: 類別不對稱;自組性演算法;集群分析;減少多數法;二元分類模型;Class Imbalanced;GMDH;Cluster Analysis;Under-sampling;Binary Classification Model
公開日期: 2010
摘要: 在現實生活中,許多二元分類模型都有類別不對稱的問題,所謂類別不對稱是指在二元類別資料集合中,某一類別的樣本數量遠多於另外一類別。若是直接以此資料集合訓練分類器,則分類器的演算傾向將少數類別樣本全部分類為多數類別,而造成少數類別分類準確率極低但整體分類準確率卻很高的現象。然而,在許多情況下,少數類別的分類準確率才是我們所關注的。因此,針對二元類別不對稱的資料,如何提高二元分類模型之少數類別資料的分類準確率且使得整體分類準確率也相當高是處理類別不對稱資料的一個重要議題。現有文獻提出利用減少多數類別樣本數量,以有效降低類別不對稱所造成的多數類別和少數類別準確率懸殊的問題,但以此種抽樣方法也有可能因資料完整性不足而導致二元分類模型之準確率不高。因此,本研究之主要目的是使用集群分析法(Cluster Analysis)和自組性演算法(Group Method of Data Handling, GMDH)減少多數類別的樣本,即在多數類別樣本中,抽出具有代表性的多數類別樣本,以過濾不具代表性的多數類別樣本,即可降低多數類別樣本和少數類別樣本的不對稱比例,進而提升分類器的分類準確率。
In real world, the class imbalanced problem exists in many of the binary classification models. The class imbalanced problem means that there is a remarkable difference between the amounts of the samples of the binary classification data. If we use the data set directly to train the classifier, the classifier tends to classify all instances to the majority class. Resulting in the accuracy of the classification model for overall instances and instances of majority class may be very high, while the accuracy of the classification model for minority class may be very low, so that the classification model can barely identify the minority class. Therefore, for imbalanced data, it is important to improve the accuracy of the classification model for minority class as well as making the accuracy of overall instances be quite good. Many studies suggested that it is effective to reduce the ratio between the majority class and the minority class by under-sampling the instances of majority class, but this sampling method may also cause the low accuracy of binary classification models due to lack of complete data. The main objective of this study is to develop an under-sampling strategy to sample the representative data of majority class using Cluster Analysis and Group Method of Data handling (GMDH). Sampling the representative data of majority class can filter the non- representative data, thus it can reduce the ratio between the samples of majority class and minority class, so that it can improve the classification accuracy of classifier. Finally two UCI datasets are used to demonstrate the effectiveness of the proposed method.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079833529
http://hdl.handle.net/11536/47876
Appears in Collections:Thesis