標題: | 資料探勘技術應用於不均衡資料預測表現比較 – 以信用卡違約風險預測為例 A Comparison of Data Mining Techniques for Imbalanced Data – An Example of Credit Card Default Risk Prediction |
作者: | 邱弘懿 黃仕斌 Chiou, Hung-Yi 科技管理研究所 |
關鍵字: | 信用卡;信用違約;信用評分模型;分類器;資料探勘;不均衡資料;credit card;credit default;credit scoring model;classifier;data mining;imbalanced data |
公開日期: | 2016 |
摘要: | 金管會資料顯示信用卡的發行量在近幾年持續增長,對於銀行來說,是否能有一個判斷信用風險的根據來核發信用卡以及調整額度是一個急需關注的重點;另一方面在2014年國際會計準則委員會發布的IFRS 9會計準則引入「預期信用損失基礎」概念,自2018年起將陸續實行,在電腦運算能力逐年提升下,未來將可以更客觀的判斷客戶違約風險,進而合理地認列呆帳費用。
過去的研究探討以及比較各式資料探勘技術在信用卡違約風險的訓練及預測結果,較專注於提高「整體」的預測準確度。然而信用違約個體在整體資料中佔比顯著較低,多數模型預測準確度容易受到多數類別的影響,造成多數類別的預測精準度高、少數類別的預測精準度低。對於銀行來說,將非違約個體分類至違約類別的分類錯誤成本會遠低於將違約類別誤判為非違約,所以找出具有高度違約風險的客戶便成了一個重要議題。
本研究希望能藉由資料前處理提供一個公允的比較基礎,比較邏輯迴歸、判別分析、最近鄰居法、樸素貝氏、類神經網路、隨機森林以及支援向量機等七種分類模型在不均衡資料下的分類表現;同時採用結合增生少數合成技術與Tomek Links的抽樣法處理原始資料類別分布不均的現象後,再次評估七種模型的分類表現,最後提出模型應用建議。 Financial Supervisory Commission provided the evidence showing a stable growth of the number of issued credit card. Therefore, there is an urgent need of a criterion to evaluate the credit risk for issuing credit cards and adjusting the credit limits. Meanwhile, IFRS9 is about to take place in 2018, the concept of Expected Credit Losses will usher the demand of a more objective and reasonable evaluation of credit default risk. Previous researches focused more on the development and comparisons of data mining techniques surrounding the topic of model accuracy as a whole. However, compared with non-defaulters, defaulters usually take a far smaller proportion, which subjects classification models to the effect of majority class and make skewed predictions. This research aims to provide a fair criterion of comparison among 7 classifiers, including Logistic Regression, Discriminant Analysis, K-Nearest Neighbors, Naïve Bayes, Artificial Neural Network, Random Forest, and Support Vector Machine. After evaluating their performances under data imbalance property, the imbalanced data will be made balanced with Synthetic Minority Oversampling Technique and Tomek Links and fed into 7 classifiers again for further performance evaluation. Finally the suggestions for appropriate classifiers under different situations will be provided based on the research result. |
URI: | http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070353524 http://hdl.handle.net/11536/140168 |
Appears in Collections: | Thesis |