完整後設資料紀錄
DC 欄位語言
dc.contributor.author柯博祥zh_TW
dc.contributor.author張源俊zh_TW
dc.contributor.author黃冠華zh_TW
dc.contributor.authorKe, Bo-Shiangen_US
dc.contributor.authorChang, Yuan-Chinen_US
dc.contributor.authorHuang, Guan-Huaen_US
dc.date.accessioned2018-01-24T07:42:47Z-
dc.date.available2018-01-24T07:42:47Z-
dc.date.issued2017en_US
dc.identifier.urihttp://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070182602en_US
dc.identifier.urihttp://hdl.handle.net/11536/142915-
dc.description.abstract在不同研究領域中,分類一直是備受矚目的議題。各個領域皆有其著重的目標,因此產生出許多不同特性的分類模型和用來評估模型的表現測度指標。為了要選出適當的模型來解決相對應的問題,了解表現測度指標間差異才能選出合理的測度去量化模型的預測能力。因此本論文中的第一部分著重於介紹這些由不同領域衍伸出來的指標,並進一步探討它們之間的不等式和在不平衡資料中是否能保有其獨特性。透過此系統性的回顧和探討,我們期許研究人員和實務者皆能有效地去運用這些表現測度。在選擇適當的測度去量化候選模型的預測能力,依然有可能因為影響點的存在而造成估計失真。在本論文第二部分,我們除了將影響函數跟區域影響應用於接受者操作特徵曲線下面積來建構理論方法,並利用累積提升表作為圖形方法辨別潛在的影響點。這些方法各有其優缺點,因此我們整合這些方法並提出一套辨識準則來有效率地診斷影響點。在本論文最後,我們將研究方向轉往分類中的熱門應用領域“主動學習”上。由於科技的進展,蒐集資料變得更加容易且資料量趨於龐大。若將這些資料逐一標記其分類,需要耗費不少金錢和努力。因此主動學習僅從資料庫挑選相對較小的資料量加以標記並建立其訓練模型。挑選訓練資料為一序貫過程,並倚賴於影響函數本身的優點來進行。我們期許透過這樣的嘗試能讓傳統的穩健統計方法能有新穎的應用。在本論文中提出的所有方法僅仰賴分類模型而得的連續分數,因此這些方法並不受限於單一統計模型。zh_TW
dc.description.abstractClassification is a popular topic in various research fields. Many distinct classifiers and performance measures are proposed depending on their concerns. Choosing a suitable model for a given problem mainly relies on the adequate performance measures, hence it is essential to figure out the differences between those measures. Introductions and further investigations, including inequalities and imbalanced data issues in these performance metrics are surveyed in the first part of this thesis. With appropriate chosen metrics, these estimated performance quantities may be diluted due to some particular influential observations and more robust to these “outliers.” As the second part in this thesis, we propose some theoretical approaches based on influence function and local influence on the area under the receiver operating characteristic curve and a graphical approach counts on cumulative lift charts targeting at those observations. These approaches are complementary to each other, hence we integrate them into a synthesized guidance to efficiently diagnose potential influential cases. Finally, we turn to a modern application - active learning, which are methods that focuses on building a reliable training classifier with a relatively small subset compared to the complete database. The subjects in this training subset are collected sequentially, and a selection criterion relies on the nature of influence function is employed in this process. We believe that this attempt of using the traditional robustness ideas will broaden the applications and developments in active learning research. The major advantage of these approaches in this thesis only requires the continuous scores; therefore, they are generally applicable to any classifiers that can produce real values.en_US
dc.language.isoen_USen_US
dc.subject接受者操作特徵曲線zh_TW
dc.subject接受者操作特徵曲線下面積zh_TW
dc.subject區域影響zh_TW
dc.subject影響函數zh_TW
dc.subject不平衡資料zh_TW
dc.subject累計提升表zh_TW
dc.subject主動學習zh_TW
dc.subject表現測度zh_TW
dc.subject二元分類zh_TW
dc.subjectReceiver operating characteristic (ROC) curveen_US
dc.subjectArea under ROC curve (AUC)en_US
dc.subjectLocal influenceen_US
dc.subjectInfluence functionen_US
dc.subjectImbalanced dataen_US
dc.subjectCumulative lift chartsen_US
dc.subjectActive learningen_US
dc.subjectPerformance measuresen_US
dc.subjectBinary classificationen_US
dc.title分類模型中表現測度的模型診斷估計和探討及其於主動學習之應用zh_TW
dc.titleOn Model Diagnostics of Performance Measures in Classification Models and Its Application to Active Learningen_US
dc.typeThesisen_US
dc.contributor.department統計學研究所zh_TW
顯示於類別:畢業論文