Title: 改善多類別不平衡資料之分類準確率
Improving the Prediction Accuracy of Classfication Model for Multi-Class Imbalanced Data
Authors: 林子硯
Lin, Tzu-Yen
Keywords: 不平衡資料;重新取樣;實驗設計;反應曲面法;自組性演算法;Imbalanced Data;Re-sampling;Design of Experiments;Response Surface Methodology;Group Method of Data Handling
Issue Date: 2015
Abstract: 在預測不同類別的資料時,一般的做法是從過去已知類別的資料中,依據各類別資料之特性建構分類模型,再藉此模型預測新資料的類別。然而在實際的類別資料中,通常某一類別的資料數量會顯著較另一類別的資料數量多,此型態資料稱為不平衡(imbalanced)資料。使用不平衡資料建構分類模型時,大部份的樣本會傾向被歸類到多數類別,而造成多數類別的分類準確率高、少數類別的分類準確率低,但整體的分類準確率卻又相當高之情形。相較於多數類別資料,少數類別的預測常是研究者有興趣的議題。無論整體準確率有多高,若無法正確分類出少數類別的資料,分類模型可能不具任何實用價值。因此,為提升少數類別資料的分類準確率,本研究利用實驗設計法(design of experiment,DOE)與反應曲面法(response surface methodology,RSM)先求得可提升少數類別資料的分類準確率之最適重新取樣比例,再使用自組性演算法(group method of data handling,GMDH)建構分類模型,並透過兩個實例來說明本研究提出的最適重新取樣方法確實可以有效提升少數類別的分類準確率。
For classifying categorical data, the common method is to construct a classification model with historical data, and classifying the new observation using the classification model. The categorical data in real-world often are imbalanced data. That is, most of data are in the majority class and few data are in the minority class. When constructing a classification model with imbalanced data, most of data tend to be classified into the majority class. Consequently, although the overall prediction accuracy of the classification model and the prediction accuracy of majority class are high, whereas the prediction accuracy of minority class is quite low. However, compared to the majority class, minority class is often concerned. No matter how high the overall classification accuracy is, if the observations of minority class cannot be classified correctly, the classification model might not have any practical use. Therefore, the objective of this study is to develop a method of improving the prediction accuracy of minority class for imbalanced data using design of experiment(DOE), Response Surface Methodology(RSM)and Group Method of Data Handling(GMDH). Finally, two real cases are utilized to verify the effectiveness of the proposed procedure.
Appears in Collections:Thesis