Predicting Win-Loss outcomes in MLB regular season games – A comparative study using data mining methods

C. Soto Valero 1
  • 1 Department of Computer Science, Universidad Central “Marta Abreu” de Las Villas, Cuba

Abstract

Baseball is a statistically filled sport, and predicting the winner of a particular Major League Baseball (MLB) game is an interesting and challenging task. Up to now, there is no definitive formula for determining what factors will conduct a team to victory, but through the analysis of many years of historical records many trends could emerge. Recent studies concentrated on using and generating new statistics called sabermetrics in order to rank teams and players according to their perceived strengths and consequently applying these rankings to forecast specific games. In this paper, we employ sabermetrics statistics with the purpose of assessing the predictive capabilities of four data mining methods (classification and regression based) for predicting outcomes (win or loss) in MLB regular season games. Our model approach uses only past data when making a prediction, corresponding to ten years of publicly available data. We create a dataset with accumulative sabermetrics statistics for each MLB team during this period for which data contamination is not possible. The inherent difficulties of attempting this specific sports prediction are confirmed using two geometry or topology based measures of data complexity. Results reveal that the classification predictive scheme forecasts game outcomes better than regression scheme, and of the four data mining methods used, SVMs produce the best predictive results with a mean of nearly 60% prediction accuracy for each team. The evaluation of our model is performed using stratified 10-fold cross-validation.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Ahmad, A., & Dey, L. (2005). A feature selection technique for classificatory analysis. Pattern Recognition Letters, 26(1), 43-56. doi:

    • Crossref
    • Export Citation
  • Alcalá-Fdez, J., Sánchez, L., García, S., Jesus, M. J., Ventura, S., Garrell, J. M., . . . Herrera, F. (2008). KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing, 13(3), 307-318. doi:

    • Crossref
    • Export Citation
  • Aslan, B. G., & Inceoglu, M. M. (2007). A comparative study on neural network based soccer result prediction. Paper presented at the Seventh International Conference on Intelligent Systems Design and Applications.

  • Baumer, B., & Zimbalist, A. (2014). Quantifying Market Inefficiencies in the Baseball Players’ Market. Eastern Economic Journal, 40(4), 488-498. doi:

    • Crossref
    • Export Citation
  • Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2), 121-167. doi:

    • Crossref
    • Export Citation
  • Chang, J., & Zenilman, J. (2013). A study of sabermetrics in Major League Baseball: The impact of Moneyball on free agent salaries.

  • Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1-2), 155-176. doi:

    • Crossref
    • Export Citation
  • Delen, D., Cogdell, D., & Kasap, N. (2012). A comparative analysis of data mining methods in predicting NCAA bowl outcomes. International Journal of Forecasting, 28(2), 543-552. doi:

    • Crossref
    • Export Citation
  • Demens, S. (2015). Riding a probabilistic support vector machine to the Stanley Cup. Journal of Quantitative Analysis in Sports, 11(4), 205-218. doi:

    • Crossref
    • Export Citation
  • Edelmann-Nusser, J., Hohmann, A., & Henneberg, B. (2002). Modeling and prediction of competitive performance in swimming upon neural networks. European Journal of Sport Science, 2(2), 1-10. doi:

    • Crossref
    • Export Citation
  • Fischer, A., Do, M., Stein, T., Asfour, T., Dillmann, R., & Schwameder, H. (2011). Recognition of Individual Kinematic Patterns during Walking and Running-A Comparison of Artificial Neural Networks and Support Vector Machines. International Journal of Computer Science in Sport, 10(1).

  • Gartheeban, G., & Guttag, J. (2013). A data-driven method for in-game decision making in MLB: when to pull a starting pitcher. Paper presented at the Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining.

  • Gutierrez-Osuna, R. (2002). The k nearest neighbor rule (k-nnr). k-NN Lecture Notes.

  • Haghighat, M., Rastegari, H., & Nourafza, N. (2013). A review of data mining techniques for result prediction in sports. Advances in Computer Science: an International Journal, 2(5), 7-12.

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explor. Newsl., 11(1), 10-18. doi:

    • Crossref
    • Export Citation
  • Hall, M. A., & Holmes, G. (2003). Benchmarking attribute selection techniques for discrete class data mining. Knowledge and Data Engineering, IEEE Transactions on, 15(6), 1437-1447. doi:

    • Crossref
    • Export Citation
  • Han, J., & Kamber, M. (2006). Data Mining Concepts and Techniques (2nd ed.): Morgan Kaufmann Publishers.

  • Haykin, S. (2008). Neural networks and learning machines (3rd ed.). New Jersey: Prentice Hall.

  • Healey, G. (2015). Modeling the Probability of a Strikeout for a Batter/Pitcher Matchup. Knowledge and Data Engineering, IEEE Transactions on, 27(9), 2415-2423. doi:

    • Crossref
    • Export Citation
  • Hornik, K., Stinchcombe, M., & White, H. (1990). Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks, 3(5), 551-560. doi:

    • Crossref
    • Export Citation
  • Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt's SMO Algorithm for SVM Classifier Design. Neural Computation, 13(3), 637-649. doi:

    • Crossref
    • Export Citation
  • Liao, S.-H., Chu, P.-H., & Hsiao, P.-Y. (2012). Data mining techniques and applications - A decade review from 2000 to 2011. Expert Systems with Applications, 39(12), 11303-11311. doi:

    • Crossref
    • Export Citation
  • Loh, W.-Y. (2014). Fifty Years of Classification and Regression Trees. International Statistical Review, 82(3), 329-348. doi:

    • Crossref
    • Export Citation
  • Loughin, T. M., & Bargen, J. L. (2008). Assessing pitcher and catcher influences on base stealing in Major League Baseball. Journal of sports sciences, 26(1), 15-20. doi:

    • Crossref
    • Export Citation
  • Menéndez, H. D., Vázquez, M., & Camacho, D. (2015). Mixed Clustering Methods to Forecast Baseball Trends. In D. Camacho, L. Braubach, S. Venticinque & C. Badica (Eds.), Intelligent Distributed Computing VIII (pp. 175-184). Cham: Springer International Publishing.

  • Morgan, S., Williams, M. D., & Barnes, C. (2013). Applying decision tree induction for identification of important attributes in one-versus-one player interactions: A hockey exemplar. Journal of sports sciences, 31(10), 1031-1037. doi:

    • Crossref
    • Export Citation
  • Ockerman, S., & Nabity, M. (2014). Predicting the Cy Young Award Winner. PURE Insights, 3(1), 9.

  • Percy, D. F. (2015). Strategy selection and outcome prediction in sport using dynamic learning for stochastic processes. Journal of the Operational Research Society, 66(11), 1840-1849. doi:

    • Crossref
    • Export Citation
  • Robertson, S., Back, N., & Bartlett, J. D. (2015). Explaining match outcome in elite Australian Rules football using team performance indicators. Journal of sports sciences, 1-8. doi:

    • Crossref
    • Export Citation
  • Robinson, S. J. (2014). Extracting Individual Offensive Production from Baseball Run Distributions. International Journal of Computer Science in Sport, 13(2).

  • Robnik-Šikonja, M., & Kononenko, I. (1997). An adaptation of Relief for attribute estimation in regression. Paper presented at the Machine Learning: Proceedings of the Fourteenth International Conference (ICML’97).

  • Rosenfeld, J. W., Fisher, J. I., Adler, D., & Morris, C. (2010). Predicting overtime with the Pythagorean formula. Journal of Quantitative Analysis in Sports, 6(2). doi:

    • Crossref
    • Export Citation
  • Sauer, R. D., Waller, J. K., & Hakes, J. K. (2010). The progress of the betting in a baseball game. Public Choice, 142(3-4), 297-313. doi:

    • Crossref
    • Export Citation
  • Schumaker, R. P., Solieman, O. K., & Chen, H. (2010a). Greyhound racing using support vector machines. Sports Data Mining (pp. 117-125): Springer US.

  • Schumaker, R. P., Solieman, O. K., & Chen, H. (2010b). Sports Data Mining: Springer US.

  • Shearer, C. (2000). The CRISP-DM model: the new blueprint for data mining. Journal of Data Warehousing, 5, 13-22.

  • Smith, E. E., & Groetzinger, J. D. (2010). Do fans matter? The effect of attendance on the outcomes of Major League Baseball games. Journal of Quantitative Analysis in Sports, 6(1). doi:

    • Crossref
    • Export Citation
  • Soto Valero, C., & González Castellanos, M. (2015). Sabermetría y nuevas tendencias en el análisis estadístico del juego de béisbol [Sabermetrics and new trends in statistical analysis of baseball]. Retos, 28(2), 122-127.

  • Stekler, H. O., Sendor, D., & Verlander, R. (2010). Issues in sports forecasting. International Journal of Forecasting, 26(3), 606-621. doi:

    • Crossref
    • Export Citation
  • Sykora, M., Chung, P. W. H., Folland, J. P., Halkon, B. J., & Edirisinghe, E. A. (2015). Advances in Sports Informatics Research Computational Intelligence in Information Systems (pp. 265-274): Springer.

  • Tin Kam, H., & Basu, M. (2002). Complexity measures of supervised classification problems. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(3), 289-300. doi:

    • Crossref
    • Export Citation
  • Trawiński, K. (2010). A fuzzy classification system for prediction of the results of the basketball games. Paper presented at the Fuzzy Systems (FUZZ), 2010 IEEE International Conference.

  • Witnauer, W. D., Rogers, R. G., & Saint Onge, J. M. (2007). Major league baseball career length in the 20th century. Population research and policy review, 26(4), 371-386. doi:

    • Crossref
    • Export Citation
  • Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining Practical Machine Learning Tools and Techniques (3rd ed.): Morgan Kaufmann Publishers.

  • Wolf, G. H. (2015). The Sabermetric Revolution: Assessing the Growth of Analytics in Baseball by Benjamin Baumer and Andrew Zimbalist (review). Journal of Sport History, 42(2), 239-241.

  • Wolpert, D. H., & Macready, W. G. (1997). No Free Lunch Theorems for Optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67-82. doi:

    • Crossref
    • Export Citation
  • Yang, T. Y., & Swartz, T. (2004). A Two-Stage Bayesian Model for Predicting Winners in Major League Baseball. Journal of Data Science, 2, 61-73.

  • Young, W. A., Holland, W. S., & Weckman, G. R. (2008). Determining hall of fame status for major league baseball using an artificial neural network. Journal of Quantitative Analysis in Sports, 4(4). doi:

    • Crossref
    • Export Citation
  • Yuan, L.-H., Liu, A., Yeh, A., Kaufman, A., Reece, A., Bull, P., . . . Bornn, L. (2015). A mixture-of-modelers approach to forecasting NCAA tournament outcomes. Journal of Quantitative Analysis in Sports, 11(1), 13-27. doi:

    • Crossref
    • Export Citation
  • Zeng, X., & Martinez, T. R. (2000). Distribution-balanced stratified cross-validation for accuracy estimation. Journal of Experimental & Theoretical Artificial Intelligence, 12(1), 1-12. doi:

    • Crossref
    • Export Citation
OPEN ACCESS

Journal + Issues

Search