標題: 多類別不平衡資料之最適重新取樣策略
Optimal Re-sampling Strategy for Multi-Class Imbalanced Data
作者: 吳秉怡
Wu, Ping-Yi
唐麗英
Tong, Lee-Ing
工業工程與管理系所
關鍵字: 多類別不平衡資料;重新取樣策略;實驗設計;雙反應曲面法;Multi-Class Imbalanced Data;Re-sampling Strategy;DOE;DRS
公開日期: 2012
摘要: 針對不同類別資料建構分類模型(classification model)以預測新資料之類別,在許多領域均非常重要,例如:在行銷方面由顧客個人資料來預測其購買商品之品牌,或銀行由貸款客戶資料來判斷其是否會違約等。因此,建構一個準確之分類模型是一個重要議題。由於在實務應用上,各類別的資料通常是不平衡資料(imbalanced data),即有一類之資料數量顯著多於或少於另一類資料之數量,若直接使用不平衡資料來建構分類模型,則不論使用何種分類方法(如:判別分析或類神經方法等),通常都會有分類模型整體分類準確率雖然相當高,但少數類別之分類準確率卻過低的情況,而在實務應用上,少數類別的分類準確率通常要比多數類別的分類準確率要重要許多,因此提升少數類別資料之分類準確率,非常重要。現有文獻大多只探討如何提升兩類別不平衡資料分類模型之分類準確率,罕見有文獻探討提升三類以上不平衡資料分類模型分類準確率的方法。因此,本研究利用實驗設計(Design of Experiment;DOE)及雙反應曲面法(Dual Response Surface Methodology;DRS),針對有多個類別之不平衡資料提出一套最適之重新取樣策略(Re-sampling Strategy),以有效提升多類別不平衡資料中少數類別資料之分類準確率。本研究最後利用KEEL資料庫所提供之多類別不平衡資料,驗證了本研究方法確實有效。
In many fields, developing an effective classification model to predict the category of incoming data is an important problem. For example, classification model can be utilized to predict certain type goods that the customers will purchase or to determine whether the loan customer will be default or not. However, real-world categorical data are often imbalanced, that is, the sample size of a particular class is significantly greater than that of others. In this case, most of the classification methods fail to construct an accurate model to classify the imbalanced data. There were several studies focused on developing binary classification models, but these models are not appropriate for data involve three or more categories. Therefore, this study introduces an optimal re-sampling strategy using design of experiments (DOE) and dual response surface methodology (DRS) to improve the accuracy of classification model for multi-class imbalanced data. The real cases from KEEL-dataset are used to demonstrate the effectiveness of the proposed procedure.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070053313
http://hdl.handle.net/11536/71472
Appears in Collections:Thesis