標題: 創新資訊粒化方式於不對稱資料之二元分類判別問題
Granulating Information within Binary Classification Scheme for Imbalanced Data Sets
作者: 林奕慶
張永佳
工業工程與管理學系
關鍵字: 分類模型;類別不對稱;粒化計算;classification model;class imbalance;Granular Computing
公開日期: 2010
摘要: 許多研究領域常隱含類別不對稱問題(class imbalance)。類別不對稱為資料中類別數量不對稱,分為多數類別(major class)與少數類別(minor class)。資料若屬於類別不對稱型態,將全部資料皆投入訓練樣本進行建模,可能發生整體分類準確率與多數類別分類準確率高,但少數類別分類準確率過低的情況。目前中外文獻提出許多建構分類模型,應用在類別不對稱的資料型態上,多數模型採用抽樣法(sampling)進行建模,抽樣法可能導致資料完整性不足、對於抽樣樣本過於敏感,產生模型失準的問題。另外,有學者提出以粒化計算的方法來處理類別不對稱問題,建模時不需透過抽樣,即可建構分類模型,並解決抽樣法所產生的問題。本研究以粒化計算的概念為基礎,結合多變量群集分析的概念與新的資訊粒化方式,來建構分類模型,修正了以往粒化計算應用於類別不對稱資料可能造成的缺失,提升粒化計算分類模型之準確率。最後與以往粒化計算和抽樣法所建構的分類模型之分類結果做比較,透過本研究提出的資訊粒化方式所建構的分類模型,證實有效降低類別不對稱所造成多數類別與少數類別分類準確率懸殊問題,維持一定整體分類準確率下提升少數類別分類準確率。
Many areas of research often implicit class imbalanced problems. Class imbalanced data means the asymmetric categories of data, a data with class imbalance problem could be divided into two categories: major class data and minor class data. If we put all the imbalanced data into the model of training samples without sampling, the accuracy of overall instances and major class instances could be very well, but poor predictive ability to identify minority instances. Many studies of building classification models for imbalanced data sets have been developed, but most of them use sampling method to deal with the class imbalanced data. Sampling method may lead to lack of data integrity and the model is so sensitive for sampling the sample as to produce inaccurate problems. In addition, some scholars have used “Granular Computing” approach to the problem of asymmetric type. Modeling is not required sampling, we can build classification model, and solve the problems arising from sampling. This study is based on the concept of Granular Computing model to tackle class imbalance problems, combined with cluster analysis of concepts and novel information granulation approach to construct classification models. This study correct the absence of the previous Granular Computing model for asymmetric information and enhance the accuracy rate of classification model. In the end, the study compares the results of classification with several sampling methods and previous Granular Computing model. By calculation and compare of the accuracy, AUC and G-means, we can conclude that using novel information granulation approach to construct classification models would have the same or even better result than sampling models and previous Granular Computing model.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079833531
http://hdl.handle.net/11536/47879
Appears in Collections:Thesis