Implications of the Dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers

doi:10.1023/A:1026367023636

Full metadata record

DC Field	Value	Language
dc.contributor.author	Hsu, CN	en_US
dc.contributor.author	Huang, HJ	en_US
dc.contributor.author	Wong, TT	en_US
dc.date.accessioned	2014-12-08T15:40:01Z	-
dc.date.available	2014-12-08T15:40:01Z	-
dc.date.issued	2003-12-01	en_US
dc.identifier.issn	0885-6125	en_US
dc.identifier.uri	http://dx.doi.org/10.1023/A:1026367023636	en_US
dc.identifier.uri	http://hdl.handle.net/11536/27341	-
dc.description.abstract	In a naive Bayesian classifier, discrete variables as well as discretized continuous variables are assumed to have Dirichlet priors. This paper describes the implications and applications of this model selection choice. We start by reviewing key properties of Dirichlet distributions. Among these properties, the most important one is "perfect aggregation," which allows us to explain why discretization works for a naive Bayesian classifier. Since perfect aggregation holds for Dirichlets, we can explain that in general, discretization can outperform parameter estimation assuming a normal distribution. In addition, we can explain why a wide variety of well-known discretization methods, such as entropy-based, ten-bin, and bin-log l, can perform well with insignificant difference. We designed experiments to verify our explanation using synthesized and real data sets and showed that in addition to well-known methods, a wide variety of discretization methods all perform similarly. Our analysis leads to a lazy discretization method, which discretizes continuous variables according to test data. The Dirichlet assumption implies that lazy methods can perform as well as eager discretization methods. We empirically confirmed this implication and extended the lazy method to classify set-valued and multi-interval data with a naive Bayesian classifier.	en_US
dc.language.iso	en_US	en_US
dc.subject	naive Bayesian classifiers	en_US
dc.subject	Dirichlet distributions	en_US
dc.subject	perfect aggregation	en_US
dc.subject	continuous variables	en_US
dc.subject	discretization	en_US
dc.subject	lazy discretization	en_US
dc.subject	interval data	en_US
dc.title	Implications of the Dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers	en_US
dc.type	Article	en_US
dc.identifier.doi	10.1023/A:1026367023636	en_US
dc.identifier.journal	MACHINE LEARNING	en_US
dc.citation.volume	53	en_US
dc.citation.issue	3	en_US
dc.citation.spage	235	en_US
dc.citation.epage	263	en_US
dc.contributor.department	資訊工程學系	zh_TW
dc.contributor.department	Department of Computer Science	en_US
dc.identifier.wosnumber	WOS:000186206000002	-
dc.citation.woscount	18	-
Appears in Collections:	Articles

Files in This Item:

000186206000002.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.