粒化計算處理不平衡資料之理論與應用

標題:	粒化計算處理不平衡資料之理論與應用 Granular computing for imbalanced data: theory and applications
作者:	陳隆昇 Chen, Long-Sheng 蘇朝墩李榮貴 Su, Chao-Ton Li, Rong-Kwei 工業工程與管理學系
關鍵字:	粒化計算;資訊粒化;類別不均問題;模糊適應共振理論網路;知識攫取;機器學習;Granular computing;Information granulation;Class imbalance problems;Fuzzy ART neural netwroks;Knowledge acquisition;Machine learning
公開日期:	2005
摘要:	近年來機器學習的發展為分類問題提供一項有效的工具。然而，當從不平衡資料(imbalanced data)學習時，傳統的方法在預測少數範例(minor examples)上，其能力是不足的。這類的問題相當重要，在許多環境、生命相關或商業重要領域中大量發生，譬如詐騙偵測、文字探勘、垃圾信件偵測、醫療診斷、錯誤監視及檢測等。在本論文中，我們提出稱為「粒化計算」(Granular Computing)的新穎方法來解決這種「類別不均問題」(Class Imbalance Problems)。粒化計算以表示和處理資訊粒(Information Granule)為導向，是一種模仿人類資訊處理本能的計算模式，逐漸在資訊科學、邏輯、哲學等領域中成為一項重要的議題。當描述一個包含不完整、不確定或是模糊資訊的問題時，人類很難去考慮詳細的數值資料，而被迫考慮『資訊粒』—是由個別元素(individual elements)依據其相似性、功能接近性或是不可分辨度所構成的集合。粒化計算的模型不僅可以移除不必要的細節、使我們看清資料的本質，更能有效地用來解決『類別不均問題』。本研究的目的在於發展出兩種粒化計算模型—「KAIG」與「IG based method」分別處理離散型(discrete)與連續型(continuous)資料。兩個模型中，兩種指標—H-index與U-ratio，被成功地導入以用來確定適當的顆粒性水準(level of granularity)，換言之，我們可以據此來確定適當的資訊粒數目。模糊適應共振理論網路(Fuzzy ART neural network)被用來建構資訊粒。此外，在「KAIG」模型中，我們提出了「附屬屬性(sub-attributes)」的觀念來描述資訊粒並可解決資訊粒彼此重疊的現象。在「IG based method」方法中，我們則是以資料特性來表示資訊粒。本研究的主要目標詳述如次： (1)發展KAIG模型來建構資訊粒，並從其中攫取知識。七個UCI資料銀行中的資料(包含一個不平衡診斷資料)，被用來評估KAIG模型的有效性，在使用不同的績效指標(如Overall Accuracy, G-mean 和 ROC curve)評估下，相較於決策樹方法(decision tree, C4.5)與支持向量分類器(Support Vector Machine)，實驗結果說明了我們所提方法的優異性。 (2)應用KAIG模型解決工業工程相關領域中的「類別不均問題」。首先，在模擬的彈性製造系統(Flexible Manufacturing Systems)環境中，KAIG模型被應用來改善動態排程系統的分類績效。其次，我們以一個手機檢測的實際案例來說明KAIG模型有極優異的能力偵測出極少數的不良品。此外，KAIG模型可以減少多餘的測試項目並縮短檢驗時程。這兩個應用實例證實對於處理不平衡資料，KAIG模型可以大幅提昇偵測少數範例的能力 (Negative Accuracy)，同時又不會減少整體的分類準確率(Overall Accuracy)。 (3)提出「IG based method」來處理連續型的不平衡資料。在這個方法中，不同的資料特性及其組合被用來表示建構好的資訊粒，然後再利用這些資訊粒的代表來建立分類器。一個糖尿病醫療診斷實例被用來評估所提方法的有效性。相較於傳統的方法，本研究所提的方法在不平衡資料的學習上表現出極佳的結果。 In recent years, the development of machine learning techniques has provided an effective avenue for classification problems. However, when learning from imbalanced data, the traditional methods have poor predictive ability to identify minority instances. This problem is of crucial importance since it is encountered by a large number of domains of great environmental, vital or commercial importance such as fraud detection, text mining, spam detection, medical diagnosis and fault monitoring/inspection. In this study, we propose novel methods called “Granular Computing” models to tackle class imbalance problems. Granular computing, which is oriented towards representing and processing Information Granules (IGs), is a computing paradigm that embraces a number of modeling frameworks. GrC imitates human instincts of processing information and is becoming a very important issue for computer science, logic, philosophy and others. When describing a problem which involves incomplete, uncertain, or vague information, we human beings tend to shy away from numbers and use aggregates to ponder the question instead. We are forced to consider IGs which are collections of entities arranged together due to their similarity, functional adjacency and indistinguishability. GrC model not only can remove unnecessary details and provide a better insight into the essence of data, but also effectively solve class imbalance problems. This study aims to develop two kinds of GrC models, “Knowledge Acquisition via Information Granulation” (KAIG) model and “Information Granules based method” (IG based method), for dealing with discrete and continuous data, respectively. In both models, the homogeneity index (H-index) and the undistinguishable ratio (U-ratio) are successfully introduced to determine a suitable level of granularity (i.e. determine suitable number of IGs). Fuzzy Adaptive Resonance Theory (Fuzzy ART) neural network is utilized to construct IGs. In addition, we propose the concept of “sub-attributes” to describe granules and tackle the overlapping among granules in KAIG model. In IG based method, data characteristics are employed to represent IGs. The main objectives of this study are: 1. Develop a KAIG model to construct IGs, and to discover knowledge from IGs. Seven data sets from UCI data bank (including one imbalanced diagnosis data), are provided to evaluate the effectiveness of KAIG model. By using different performance indexes, Overall Accuracy, G-mean and ROC curve, the experimental results comparing with C4.5 and Support Vector Machine (SVM) demonstrate the superiority of our method. 2. Apply KAIG model to solve class imbalance problems in industrial engineering related areas. First, KAIG model is utilized to improve the classification performance of a dynamic scheduling system within a simulated Flexible Manufacturing System environment. Second, a real case of cellular phones inspection is provided to illustrate the excellent ability of KAIG model in identifying rare defective products. In addition, KAIG model can reduce redundant test items and shorten inspection time. For imbalanced data, these applications show KAIG model can dramatically increase Negative Accuracy (the capability of detecting minor instances) without losing Overall Accuracy. 3. Propose IG based method to deal with continuous imbalanced data. In this method, different data characteristics and their combinations are employed to denote constructed IGs. Then we build a classifier from these representatives of IGs. An actual medical diagnosis data of diabetes is used to evaluate the effectiveness of this method. Compared with traditional techniques, the proposed method is shown to be superior for learning on imbalanced data.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT009133814 http://hdl.handle.net/11536/57935
Appears in Collections:	Thesis

Files in This Item:

381401.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.