标题: | 多类别不平衡资料之最适重新取样策略 Optimal Re-sampling Strategy for Multi-Class Imbalanced Data |
作者: | 吴秉怡 Wu, Ping-Yi 唐丽英 Tong, Lee-Ing 工业工程与管理系所 |
关键字: | 多类别不平衡资料;重新取样策略;实验设计;双反应曲面法;Multi-Class Imbalanced Data;Re-sampling Strategy;DOE;DRS |
公开日期: | 2012 |
摘要: | 针对不同类别资料建构分类模型(classification model)以预测新资料之类别,在许多领域均非常重要,例如:在行销方面由顾客个人资料来预测其购买商品之品牌,或银行由贷款客户资料来判断其是否会违约等。因此,建构一个准确之分类模型是一个重要议题。由于在实务应用上,各类别的资料通常是不平衡资料(imbalanced data),即有一类之资料数量显着多于或少于另一类资料之数量,若直接使用不平衡资料来建构分类模型,则不论使用何种分类方法(如:判别分析或类神经方法等),通常都会有分类模型整体分类准确率虽然相当高,但少数类别之分类准确率却过低的情况,而在实务应用上,少数类别的分类准确率通常要比多数类别的分类准确率要重要许多,因此提升少数类别资料之分类准确率,非常重要。现有文献大多只探讨如何提升两类别不平衡资料分类模型之分类准确率,罕见有文献探讨提升三类以上不平衡资料分类模型分类准确率的方法。因此,本研究利用实验设计(Design of Experiment;DOE)及双反应曲面法(Dual Response Surface Methodology;DRS),针对有多个类别之不平衡资料提出一套最适之重新取样策略(Re-sampling Strategy),以有效提升多类别不平衡资料中少数类别资料之分类准确率。本研究最后利用KEEL资料库所提供之多类别不平衡资料,验证了本研究方法确实有效。 In many fields, developing an effective classification model to predict the category of incoming data is an important problem. For example, classification model can be utilized to predict certain type goods that the customers will purchase or to determine whether the loan customer will be default or not. However, real-world categorical data are often imbalanced, that is, the sample size of a particular class is significantly greater than that of others. In this case, most of the classification methods fail to construct an accurate model to classify the imbalanced data. There were several studies focused on developing binary classification models, but these models are not appropriate for data involve three or more categories. Therefore, this study introduces an optimal re-sampling strategy using design of experiments (DOE) and dual response surface methodology (DRS) to improve the accuracy of classification model for multi-class imbalanced data. The real cases from KEEL-dataset are used to demonstrate the effectiveness of the proposed procedure. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT070053313 http://hdl.handle.net/11536/71472 |
显示于类别: | Thesis |