21st International Conference on Machine Learning -- 2004

Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood

AUTHORS: Daniel Grossman and Pedro Domingos

Abstract

Bayesian networks are a powerful probabilistic representation, and their use for classification has received considerable attention. However, they tend to perform poorly when learned in the standard way. This is attributable to a mismatch between the objective function used (likelihood or a function thereof) and the goal of classification (maximizing accuracy or conditional likelihood). Unfortunately, the computational cost of optimizing structure and parameters for conditional likelihood is prohibitive. In this paper we show that a simple approximation -- choosing structures by maximizing conditional likelihood while setting parameters by maximum likelihood -- yields good results. On a large suite of benchmark datasets, this approach produces better class probability estimates than naive Bayes, TAN, and generatively-trained Bayesian networks.

Keywords:
Bayesian classification, Bayesian networks, conditional likelihood, discriminative learning

Please download our paper here: Bayesian Network Classifiers

References

  1. Agresti, A. (1990). Categorical data analysis. New York, NY: Wiley.
  2. Allen, T. V., & Greiner, R. (2000). "Model selection criteria for learning belief nets: An empirical comparison." Proceedings of the Seventeenth International Conference on Machine Learning (pp. 1047-1054). Stanford, CA: Morgan Kaufmann.
  3. Bilmes, J. A., Zweig, G., Richardson, T., Filali, K., Livescu, K., Xu, P., Jackson, K., Brandman, Y., Sandness, E., Holtz, E., Torres, J., & Byrne, B. (2001). Discriminatively structured graphical models for speech recognition (Technical Report). Johns Hopkins University, Center for Language and Speech Processing, Baltimore, MD.
  4. Blake, C., & Merz, C. J. (2000). UCI repository of machine learning databases. Department of Information and Computer Science, University of California at Irvine, Irvine, CA.
  5. Chickering, D. M., & Heckerman, D. (1997). "Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables." Machine Learning, 29, 181-212.
  6. Chow, C. K., & Liu, C. N. (1968). "Approximating discrete probability distributions with dependence trees." IEEE Transactions on Information Theory, 14, 462-467.
  7. Cooper, G., & Herskovits, E. (1992). "A Bayesian method for the induction of probabilistic networks from data." Machine Learning, 9, 309-347.
  8. Domingos, P., & Pazzani, M. (1997). "On the optimality of the simple Bayesian classifier under zero-one loss." Machine Learning, 29, 103-130.
  9. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York, NY: Wiley.
  10. Friedman, J. H. (1997). "On bias, variance, 0/1 - loss, and the curse-of-dimensionality." Data Mining and Knowledge Discovery, 1, 55-77.
  11. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). "Bayesian network classifiers." Machine Learning, 29, 131-163.
  12. Greiner, R., & Zhou, W. (2002). "Structural extension to logistic regression: Discriminative parameter learning of belief net classifiers." Proceedings of the Eighteenth National Conference on Artificial Intelligence (pp. 167-173). Edmonton, Canada.
  13. Heckerman, D. (1996). "Bayesian networks for knowledge discovery." In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining, 273-305. Menlo Park, CA: AAAI Press.
  14. Heckerman, D., Geiger, D., & Chickering, D. M. (1995). "Learning Bayesian networks: The combination of knowledge and statistical data." Machine Learning, 20, 197-243.
  15. Jaakkola, T., Meila, M., & Jebara, T. (1999). "Maximum entropy discrimination." In S. A. Solla, T. K. Leen and K.-R. Müller (Eds.), Advances in neural information processing systems 12. Cambridge, MA: MIT Press.
  16. Jebara, T., & Pentland, A. (1999). "Maximum conditional likelihood via bound maximization and the CEM algorithm." In M. S. Kearns, S. A. Solla and D. A. Cohn (Eds.), Advances in neural information processing systems 11. Cambridge, MA: MIT Press.
  17. Keogh, E., & Pazzani, M. (1999). "Learning augmented bayesian classifiers: A comparison of distribution-based and classification-based approaches." Uncertainty 1999, 7th. International Workshop on AI and Statistics (pp.\/ 225-230). Ft. Lauderdale, FL.
  18. Kohavi, R., & John, G. (1997). "Wrappers for feature subset selection." Artificial Intelligence, 97, 273-324.
  19. Lam, W., & Bacchus, F. (1994). "Learning Bayesian belief networks: An approach based on the MDL principle." Computational Intelligence, 10, 269-293.
  20. Ng, A. Y., & Jordan, M. I. (2002). "On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes." In T. G. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in neural information processing systems 14. Cambridge, MA: MIT Press.
  21. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco, CA: Morgan Kaufmann.
  22. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C. Cambridge, UK: Cambridge University Press. 2nd edition.
  23. Provost, F., & Domingos, P. (2003). "Tree induction for probability-based ranking." Machine Learning, 52, 199-216.
  24. Provost, F., & Fawcett, T. (2001). "Robust classification for imprecise environments." Machine Learning, 42, 203-231.
  25. Rubinstein, Y. D., & Hastie, T. (1997). "Discriminative vs. informative learning." Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. Newport Beach, CA: AAAI Press.
  26. Spirtes, P., Glymour, C., & Scheines, R. (1993). Causation, prediction, and search. New York, NY: Springer.
  27. Vapnik, V. N. (1998). Statistical learning theory. New York, NY: Wiley.

This page last updated: Tuesday, October 12, 2004.