21st International Conference on Machine Learning -- 2004
Learning Bayesian Network Classifiers
by Maximizing Conditional Likelihood
AUTHORS: Daniel Grossman and Pedro Domingos
Abstract
Bayesian networks are a powerful probabilistic representation, and
their use for classification has received considerable attention.
However, they tend to perform poorly when learned in the standard
way. This is attributable to a mismatch between the objective function
used (likelihood or a function thereof) and the goal of classification
(maximizing accuracy or conditional likelihood). Unfortunately, the
computational cost of optimizing structure and parameters for
conditional likelihood is prohibitive. In this paper we show that a
simple approximation -- choosing structures by maximizing conditional
likelihood while setting parameters by maximum likelihood -- yields
good results. On a large suite of benchmark datasets, this approach
produces better class probability estimates than naive Bayes, TAN, and
generatively-trained Bayesian networks.
Keywords:
Bayesian classification, Bayesian networks, conditional likelihood, discriminative learning
Please download our paper
here: Bayesian Network Classifiers
References
- Agresti, A. (1990). Categorical data analysis. New York, NY: Wiley.
- Allen, T. V., & Greiner, R. (2000). "Model selection criteria for
learning belief nets: An empirical comparison." Proceedings of
the Seventeenth International Conference on Machine Learning
(pp. 1047-1054). Stanford, CA: Morgan Kaufmann.
- Bilmes, J. A., Zweig, G., Richardson, T., Filali, K., Livescu, K., Xu,
P., Jackson, K., Brandman, Y., Sandness, E., Holtz, E., Torres, J.,
& Byrne, B. (2001). Discriminatively structured graphical
models for speech recognition (Technical Report). Johns Hopkins
University, Center for Language and Speech Processing, Baltimore,
MD.
- Blake, C., & Merz, C. J. (2000). UCI repository
of machine learning databases. Department of Information
and Computer Science, University of California at Irvine, Irvine,
CA.
- Chickering, D. M., & Heckerman, D. (1997). "Efficient approximations
for the marginal likelihood of Bayesian networks with hidden
variables." Machine Learning, 29, 181-212.
- Chow, C. K., & Liu, C. N. (1968). "Approximating discrete probability
distributions with dependence trees." IEEE Transactions on
Information Theory, 14, 462-467.
- Cooper, G., & Herskovits, E. (1992). "A Bayesian method for the
induction of probabilistic networks from data." Machine
Learning, 9, 309-347.
- Domingos, P., & Pazzani, M. (1997). "On the optimality of the simple
Bayesian classifier under zero-one loss." Machine Learning,
29, 103-130.
- Duda, R. O., & Hart, P. E. (1973). Pattern classification and
scene analysis. New York, NY: Wiley.
- Friedman, J. H. (1997). "On bias, variance, 0/1 - loss, and the
curse-of-dimensionality." Data Mining and Knowledge Discovery,
1, 55-77.
- Friedman, N., Geiger, D., & Goldszmidt, M. (1997). "Bayesian network
classifiers." Machine Learning, 29, 131-163.
- Greiner, R., & Zhou, W. (2002). "Structural extension to logistic
regression: Discriminative parameter learning of belief net
classifiers." Proceedings of the Eighteenth National Conference
on Artificial Intelligence (pp. 167-173). Edmonton, Canada.
- Heckerman, D. (1996). "Bayesian networks for knowledge discovery." In
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy
(Eds.), Advances in knowledge discovery and data mining,
273-305. Menlo Park, CA: AAAI Press.
- Heckerman, D., Geiger, D., & Chickering, D. M. (1995). "Learning
Bayesian networks: The combination of knowledge and statistical
data." Machine Learning, 20, 197-243.
- Jaakkola, T., Meila, M., & Jebara, T. (1999). "Maximum entropy
discrimination." In S. A. Solla, T. K. Leen and K.-R. Müller (Eds.),
Advances in neural information processing systems
12. Cambridge, MA: MIT Press.
- Jebara, T., & Pentland, A. (1999). "Maximum conditional likelihood via
bound maximization and the CEM algorithm." In M. S. Kearns,
S. A. Solla and D. A. Cohn (Eds.), Advances in neural information
processing systems 11. Cambridge, MA: MIT Press.
- Keogh, E., & Pazzani, M. (1999). "Learning augmented bayesian
classifiers: A comparison of distribution-based and
classification-based approaches." Uncertainty 1999,
7th. International Workshop on AI and Statistics (pp.\/
225-230). Ft. Lauderdale, FL.
- Kohavi, R., & John, G. (1997). "Wrappers for feature subset selection."
Artificial Intelligence, 97, 273-324.
- Lam, W., & Bacchus, F. (1994). "Learning Bayesian belief networks: An
approach based on the MDL principle." Computational
Intelligence, 10, 269-293.
- Ng, A. Y., & Jordan, M. I. (2002). "On discriminative vs. generative
classifiers: A comparison of logistic regression and naive Bayes."
In T. G. Dietterich, S. Becker and Z. Ghahramani (Eds.),
Advances in neural information processing systems 14. Cambridge,
MA: MIT Press.
- Pearl, J. (1988). Probabilistic reasoning in intelligent systems:
Networks of plausible inference. San Francisco, CA: Morgan
Kaufmann.
- Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery,
B. P. (1992). Numerical recipes in C. Cambridge, UK:
Cambridge University Press. 2nd edition.
- Provost, F., & Domingos, P. (2003). "Tree induction for
probability-based ranking." Machine Learning, 52,
199-216.
- Provost, F., & Fawcett, T. (2001). "Robust classification for
imprecise environments." Machine Learning, 42, 203-231.
- Rubinstein, Y. D., & Hastie, T. (1997). "Discriminative
vs. informative learning." Proceedings of the Third International
Conference on Knowledge Discovery and Data Mining. Newport
Beach, CA: AAAI Press.
- Spirtes, P., Glymour, C., & Scheines, R. (1993). Causation,
prediction, and search. New York, NY: Springer.
- Vapnik, V. N. (1998). Statistical learning theory. New York,
NY: Wiley.
This page last updated: Tuesday, October 12, 2004.