標題: 改善兩類別不平衡資料之分類模型準確率
Improving the Prediction Accuracy of Classification Model for Two Types of Imbalanced Data
作者: 賴宗偉
唐麗英
洪瑞雲
Lai, Tsung-Wei
Tong, Lee-Ing
Horng, Ruey-Yun
工業工程與管理系所
關鍵字: 不平衡資料;重新取樣;實驗設計;雙反應曲面法;分類模型;兩類別資料;Imbalanced data;resampling;Design of Experiments;Dual Response Surface Methodologies;classification model;Two-class dataset
公開日期: 2017
摘要: 許多領域常需要對資料建構分類模型(classification model)以預測未來之資料歸屬之群組,故提升分類模型之準確率是一個非常重要的議題。在現實世界中,各類別的資料數量通常不會相同,且常出現某一類別之資料量會明顯多於其他類別之資料量,此類資料稱為不平衡資料(imbalanced data)。針對不平衡資料建立分類模型時,由於各類別資料數量的差異,可能會發生分類模型的整體預測準確率相當高,且預測多數類別資料的準確率高,但預測少數類別資料的準確率卻相當低的情形。然而許多實際應用案例顯示,研究者常會對少數類別資料之預測準確性特別感興趣,故希望分類模型在預測少數類時準確性要高。目前大多數資料是屬於兩類別型態之資料,因此,本研究針對兩類別不平衡資料在建構分類模型前,先利用實驗設計(Design of Experiment)與雙反應曲面法(Dual Response Surface Methodology),找出最適之多數類別需要重新抽樣及少數類別需要增生之樣本數量,再用經過調整之樣本數來建立分類模型,以少數類別資料之分類準確率在研究者可以接受的情況下,最大化多數類別之料之分類準確率。本研究最後利用KEEL資料庫中三個兩類別不平衡資料來說明本研究方法確實能有效改善兩類別資料分類模型中少數類別資料的準確率。
In many fields, it is necessary to construct a classification model to classify the future observations. Therefore, it is an important issue to assure the accuracy of the classification model. In many real-world data, the observations in each class is usually not the same, that is, the amount of number of data in a particular class may significantly greater than that of other classes. Such data are called imbalanced data. When a classification model is established for the imbalanced data, the prediction accuracy of the majority class is high, but the prediction accuracy of the minority class is relatively low. In many practical cases, the researchers may be interested in having a high accuracy rate of classifying observations into the minority class. Because most data belong to two-class data, this study uses Design of Experiments (D.O.E.) and Dual Response Surface methodology to find an optimal resampling strategy for the majority class and the minority class. Then,applying the optimal resampling strategy to adjust the number of observations in the majority and minority classes, respectively. The accuracy rate of classifying observations into the minority class can significantly be improved. Finally, three datasets from the KEEL-dataset repository are used to demonstrate the effectiveness of the proposed method.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070453346
http://hdl.handle.net/11536/141061
顯示於類別:畢業論文