標題: 應用改良之粒化計算與主成份分析提升不對稱資料之分類準確率
Using a Novel Granular Computing and Principal Component Analysis to Improve the Classification Accuracy of Imbalanced data
作者: 蔡函伶
唐麗英
李榮貴
工業工程與管理學系
關鍵字: 類別不對稱;主成份分析;粒化計算;資訊粒子;分類模型;決策樹;Class Imbalanced;Principal Component Analysis;Granular Computing;Information Granule;Classification Model;Decision Tree
公開日期: 2010
摘要: 在許多建構二元分類模型的實務案例中,通常存在著類別不對稱問題(class imbalanced problem)。所謂類別不對稱是指大部分的樣本屬於多數類別(major class),而極少數的樣本屬於少數類別(minor class)。傳統之分類器是依據訓練樣本進行「學習」,以建構出能夠區分二元類別資料的模型,然而,當面臨類別不對稱問題時,分類器可能發生整體分類準確率與多數類別準確率均很高,但少數類別準確率卻過低的情況。為解決上述問題,現有中外文獻在建構類別不對稱資料的分類模型時,大多採用抽樣法(sampling)來進行建模,以降低分類模型對少數類別資料分類準確率過低的情況,然而,抽樣法會導致資料的完整性不足,而使分類模型之分類準確率不高。另有文獻以粒化計算(Granular Computing)的概念結合傳統分類器來處理類別不對稱問題,此方法不需要透過抽樣法,而是將全部資料納入建構分類模型,但粒化計算是將所有數值變數資料皆進行粒化,會增加計算的時間和複雜性,且在將數據資料分群形成資訊粒子(Information Granule)的過程中,有可能將少數類別的資料平均分散於各個粒子中,使得粒化後仍然呈現屬於多數類別的資訊粒子之個數仍遠大於屬於少數類別的資訊粒子之個數,更嚴重的情況是可能根本沒有形成屬於少數類別的資訊粒子,而造成分類器分類準確率不高的問題。因此,本研究之主要目的是提出一個改良之資訊粒化方式結合主成份分析來建構二元不對稱類別資料之分類模型。本研究首先應用主成份分析來縮減數值變數之資料維度,然後利用K-means分群法將資料分群形成資訊粒子,再利用本研究所提出之新思維,將存在雜質之粒子「淨化」,以淨化後之資訊粒子取代原始數據資料來建構分類模型,以有效降低資料之不對稱性對分類器所造成的影響,進而達到提升少數類別資料分類準確率之目的。此外,本研究利用附屬屬性(Sub-Attribute)的方式來表示資訊粒子,最後利用決策樹建構分類模型。經過實例驗證後可知,本研究所提出的方法所建構的分類模型確實可以降低類別不對稱所造成的多數類別和少數類別準確率懸殊的問題,換言之,在維持良好之整體準確率的情況下,可有效提升少數類別資料之分類準確率。
In many practical applications of binary classification cases, the class imbalanced problem based on asymmetric categories of data has been found. Data with class imbalanced problem could be classified into two categories: majority class instances and minority class instances. In general, the traditional classification algorithms are used to construct a model to distinguish instances between two categories based on training samples. For imbalanced data, the accuracy of the classification model for overall instances and instances of majority class may be high, while it can barely identify the minority class. Many classification models using sampling method have been developed to deal with the class imbalanced data. However, the sampling method with lack of complete information may lead the classification model to have poor accuracy. Some literatures showed that the granular computing concept combined with the traditional classification algorithms can tackle class imbalanced problems. In this method, complete data without sampling were used for model construction which lead to long computing time. In this study, a novel granular computing and principal component analysis are utilized to construct a classification model to improve the classification accuracy. Finally several examples are utilized to demonstrate the effectiveness of the proposed method.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079833541
http://hdl.handle.net/11536/47888
Appears in Collections:Thesis