分類模型中表現測度的模型診斷估計和探討及其於主動學習之應用

標題:	分類模型中表現測度的模型診斷估計和探討及其於主動學習之應用 On Model Diagnostics of Performance Measures in Classification Models and Its Application to Active Learning
作者:	柯博祥張源俊黃冠華 Ke, Bo-Shiang Chang, Yuan-Chin Huang, Guan-Hua 統計學研究所
關鍵字:	接受者操作特徵曲線;接受者操作特徵曲線下面積;區域影響;影響函數;不平衡資料;累計提升表;主動學習;表現測度;二元分類;Receiver operating characteristic (ROC) curve;Area under ROC curve (AUC);Local influence;Influence function;Imbalanced data;Cumulative lift charts;Active learning;Performance measures;Binary classification
公開日期:	2017
摘要:	在不同研究領域中，分類一直是備受矚目的議題。各個領域皆有其著重的目標，因此產生出許多不同特性的分類模型和用來評估模型的表現測度指標。為了要選出適當的模型來解決相對應的問題，了解表現測度指標間差異才能選出合理的測度去量化模型的預測能力。因此本論文中的第一部分著重於介紹這些由不同領域衍伸出來的指標，並進一步探討它們之間的不等式和在不平衡資料中是否能保有其獨特性。透過此系統性的回顧和探討，我們期許研究人員和實務者皆能有效地去運用這些表現測度。在選擇適當的測度去量化候選模型的預測能力，依然有可能因為影響點的存在而造成估計失真。在本論文第二部分，我們除了將影響函數跟區域影響應用於接受者操作特徵曲線下面積來建構理論方法，並利用累積提升表作為圖形方法辨別潛在的影響點。這些方法各有其優缺點，因此我們整合這些方法並提出一套辨識準則來有效率地診斷影響點。在本論文最後，我們將研究方向轉往分類中的熱門應用領域“主動學習”上。由於科技的進展，蒐集資料變得更加容易且資料量趨於龐大。若將這些資料逐一標記其分類，需要耗費不少金錢和努力。因此主動學習僅從資料庫挑選相對較小的資料量加以標記並建立其訓練模型。挑選訓練資料為一序貫過程，並倚賴於影響函數本身的優點來進行。我們期許透過這樣的嘗試能讓傳統的穩健統計方法能有新穎的應用。在本論文中提出的所有方法僅仰賴分類模型而得的連續分數，因此這些方法並不受限於單一統計模型。 Classification is a popular topic in various research fields. Many distinct classifiers and performance measures are proposed depending on their concerns. Choosing a suitable model for a given problem mainly relies on the adequate performance measures, hence it is essential to figure out the differences between those measures. Introductions and further investigations, including inequalities and imbalanced data issues in these performance metrics are surveyed in the first part of this thesis. With appropriate chosen metrics, these estimated performance quantities may be diluted due to some particular influential observations and more robust to these “outliers.” As the second part in this thesis, we propose some theoretical approaches based on influence function and local influence on the area under the receiver operating characteristic curve and a graphical approach counts on cumulative lift charts targeting at those observations. These approaches are complementary to each other, hence we integrate them into a synthesized guidance to efficiently diagnose potential influential cases. Finally, we turn to a modern application - active learning, which are methods that focuses on building a reliable training classifier with a relatively small subset compared to the complete database. The subjects in this training subset are collected sequentially, and a selection criterion relies on the nature of influence function is employed in this process. We believe that this attempt of using the traditional robustness ideas will broaden the applications and developments in active learning research. The major advantage of these approaches in this thesis only requires the continuous scores; therefore, they are generally applicable to any classifiers that can produce real values.
URI:	http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070182602 http://hdl.handle.net/11536/142915
Appears in Collections:	Thesis