Determining the optimal re-sampling strategy for a classification model with imbalanced data using design of experiments and response surface methodologies

doi:10.1016/j.eswa.2010.09.087

Full metadata record

DC Field	Value	Language
dc.contributor.author	Tong, Lee-Ing	en_US
dc.contributor.author	Chang, Yung-Chia	en_US
dc.contributor.author	Lin, Shan-Hui	en_US
dc.date.accessioned	2014-12-08T15:11:52Z	-
dc.date.available	2014-12-08T15:11:52Z	-
dc.date.issued	2011-04-01	en_US
dc.identifier.issn	0957-4174	en_US
dc.identifier.uri	http://dx.doi.org/10.1016/j.eswa.2010.09.087	en_US
dc.identifier.uri	http://hdl.handle.net/11536/9099	-
dc.description.abstract	Imbalanced data are common in many machine learning applications. In an imbalanced data set, the number of instances in at least one class is significantly higher or lower than that in other classes. Consequently, when classification models with imbalanced data are developed, most classifiers are subjected to an unequal number of instances in each class, thus failing to construct an effective model. Balancing sample sizes for various classes using a re-sampling strategy is a conventional means of enhancing the effectiveness of a classification model for imbalanced data. Despite numerous attempts to determine the appropriate re-sampling proportion in each class by using a trial-and-error method in order to construct a classification model with imbalanced data (Barandela, Vadovinos, Sanchez, & Ferri, 2004; He, Han, & Wang, 2005; Japkowicz, 2000; McCarthy, Zabar, & Weiss, 2005), the optimal strategy for each class may be infeasible when using such a method. Therefore, this work proposes a novel analytical procedure to determine the optimal re-sampling strategy based on design of experiments (DOE) and response surface methodologies (RSM). The proposed procedure, S-RSM, can be utilized by any classifier. Also, C4.5 algorithm is adopted for illustration. The classification results are evaluated by using the area under the receiver operating characteristic curve (AUC) as a performance measure. Among the several desirable features of the AUC index include independence of the decision threshold and invariance to a priori class probabilities. Furthermore, five real world data sets demonstrate that the higher AUC score of the classification model based on the training data obtained from the S-RSM is than that obtained using oversampling approach or undersampling approach. (C) 2010 Elsevier Ltd. All rights reserved.	en_US
dc.language.iso	en_US	en_US
dc.subject	Re-sampling strategy	en_US
dc.subject	Imbalanced data	en_US
dc.subject	Classifier	en_US
dc.subject	Machine learning	en_US
dc.subject	Design of experiments	en_US
dc.subject	Response surface methodologies	en_US
dc.subject	The area under ROC curve	en_US
dc.title	Determining the optimal re-sampling strategy for a classification model with imbalanced data using design of experiments and response surface methodologies	en_US
dc.type	Article	en_US
dc.identifier.doi	10.1016/j.eswa.2010.09.087	en_US
dc.identifier.journal	EXPERT SYSTEMS WITH APPLICATIONS	en_US
dc.citation.volume	38	en_US
dc.citation.issue	4	en_US
dc.citation.spage	4222	en_US
dc.citation.epage	4227	en_US
dc.contributor.department	工業工程與管理學系	zh_TW
dc.contributor.department	Department of Industrial Engineering and Management	en_US
dc.identifier.wosnumber	WOS:000286904600141	-
dc.citation.woscount	7	-
Appears in Collections:	Articles

Files in This Item:

000286904600141.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.