標題: 針對非離散及整合型資料於簡易貝氏分類器上的分析及應用
A Study of Naive Bayesian Classifiers for Nondiscrete and Aggregate Data
作者: 黃鴻儒
Hung-Ju Huang
許鈞南
李嘉晃
Chun-Nan Hsu
Chia-Hoang Lee
資訊科學與工程研究所
關鍵字: 簡易貝氏分類器;連續型變數;區間查詢;同質集;語者辨識;Naive Bayesian classifier;Continuous variable;Interval query;Homologous set;Speaker recognition
公開日期: 2001
摘要: 簡易貝氏分類器是一種簡單而且有用的分類工具。它已經被廣泛的使用在離散型變數的系統上,但對於處理其它非離散型變數卻有其窒礙難行的地方,如連續型變數。本論文主要是討論如何應用簡易貝氏分類器於處理多種常見的非離散型及整合型資料上。針對連續型變數,本論文顯示在一般情況下將連續型變數切割離散化後的效果會比假設它是常態分佈來的好。本論文並解釋了為什麼前人所提出各種不同切割連續型變數的方法對於簡易貝氏分類器來說其效果都差不多。經由分析,我們提出了一個稱為懶惰切割的切割方法,這個方法是根據測試資料來對連續型變數做動態的切割。此法不僅可以對連續型變數做有效的動態切割,且可使簡易貝氏分類器處理集合型,區間型,及多重區間型資料的分類查詢問題。對於整合型資料,本論文探討了如何使用簡易貝氏分類器來分類同質集。我們定義同質集內的樣本是來自於同一種未知的類別,像這樣型態的資料常常可在多種應用上中遇到。我們深入探討如何運用我們知道同質集內的每個樣本是屬於同一種未知的類別的這個資訊來提高簡易貝氏分類器的分類正確性。我們提出一個方法,稱作同質簡易貝氏分類器,是由簡易貝氏分類器擴充並將整個同質集當成一個物件做為輸入的分類器。將此法與常用的投票方法及其它幾種方法相比較,此法明顯的優於它種方法,即使當同質集內的樣本數還很少時也有很好的效果。我們並將此法成功的運用在語者辨識上,但其應用範圍不僅僅侷限在語者辨識系統。
Naive Bayes is a simple and useful classification tool. It is the most commonly used in situations which all the variables are discrete because naive Bayes is difficult to model complex probability densities over nondiscrete data such as continuous variables. This thesis describes how to use naive Bayes to classify several types of nondiscrete and aggregate data. We show that, in general discretization of continuous variables can outperform parameter estimation assuming a normal distribution. Based on our analysis, we can explain why a wide variety of well-known discretization methods can perform well with insignificant difference. Our analysis leads to a lazy discretization method, which dynamically discretizes continuous variables according to test data. This method can be extended to allow a naive Bayes to classify set-valued, interval and multi-interval query data. We also address the problem of how to classify a set of query vectors belonging to the same unknown class. Sets of data known to be sampled from the same class are often seen in many application domains. We refer to these sets as homologous sets. We show how to take advantage of homologous sets in classification to improve accuracy over by classifying each query vector individually. Our method, called homologous naive Bayes (HNB), uses a modified classification procedure that classifies multiple instances as a single unit. Compared with a voting method and several other variants of naive Bayes classification, HNB significantly outperforms these methods in a variety of test data sets, even when the number of query vectors in the homologous sets is small. We also report a successful application of HNB to speaker recognition.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT900394057
http://hdl.handle.net/11536/68583
Appears in Collections:Thesis