標題: 以模型為基礎之資料生成法應用於不平衡資料
Model-Based Synthetic Sampling for Imbalanced Data
作者: 謝博硯
劉建良
Hsieh, Po-Yen
Liu, Chien-Liang
工業工程與管理系所
關鍵字: 以模型為基礎之資料生成法;分類問題;不平衡資料;不平衡問題;Model-based synthetic sampling (MBS);Classification;Imbalanced data;Imbalance problem
公開日期: 2017
摘要: 不平衡資料(imbalanced data)是指不同類別之資料量有顯著差異,該問題常出現於各領域中,例如醫學、商務、或工廠製造,因此近年來受到相當大的重視。由於分類器(classifier)之設計大多假設資料類別分布相對平衡、或是假設不同分類錯誤之成本相同,導致少數類別資料較容易被忽略。因為我們往往感興趣的是少數類別資料,因此造成大多分類器處理不平衡資料問題時,預測績效嚴重下降的結果。針對不平衡資料問題,雖然文獻中已經有不少方法提出來,然而,這些方法之通用性並不夠高,時常無法達到穩定的提升效果。可能的原因為資料特性相當複雜而過去提出之機制過於簡單且包含不必要的假設,以至於無法生成出合理的人造資料。本研究針對不平衡問題提出嶄新之架構,即「以模型為基礎之資料生成法」(model-based synthetic sampling , MBS)。該方法結合建模學習與重新抽樣,使生成的人造資料更相似真實的少數類別資料,並且避免了各種重新抽樣與資料生成法之缺點。實驗結果證實MBS之績效提升效果與穩定度優於過去提出的方法。
Classification for imbalanced data, which means there is a significant difference between the data size of different classes, have caught much more attention recently. The prediction performance usually deteriorates as classifiers learn from imbalanced data, because most classifiers assume the class distribution is balanced or the cost of different kind of classification errors are equal. Imbalance problem occurs in various application domains and the impact have inspired numerous researchers to develop methods. However, it is difficult to generalize those methods to achieve stable improvement in most cases. The possible reason is that due to the complexity of data pattern, those methods fail to generate reasonable synthetic data by too simple techniques and unnecessary assumptions. We propose a brand new framework, model-based synthetic sampling (MBS), to cope with imbalance problem. MBS integrates modeling and sampling to create synthetic data more similar to the real minority data and avoid the drawback of state-of-the-art sampling methods. The experimental results indicate that MBS outperforms other sampling methods in terms of effectiveness and robustness.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070453327
http://hdl.handle.net/11536/142402
顯示於類別:畢業論文