Predicting Dropout Student: An Application of Data Mining Methods in an Online Education Program

Open access


This study examined the prediction of dropouts through data mining approaches in an online program. The subject of the study was selected from a total of 189 students who registered to the online Information Technologies Certificate Program in 2007-2009. The data was collected through online questionnaires (Demographic Survey, Online Technologies Self-Efficacy Scale, Readiness for Online Learning Questionnaire, Locus of Control Scale, and Prior Knowledge Questionnaire). The collected data included 10 variables, which were gender, age, educational level, previous online experience, occupation, self efficacy, readiness, prior knowledge, locus of control, and the dropout status as the class label (dropout/not). In order to classify dropout students, four data mining approaches were applied based on k-Nearest Neighbour (k-NN), Decision Tree (DT), Naive Bayes (NB) and Neural Network (NN). These methods were trained and tested using 10-fold cross validation. The detection sensitivities of 3-NN, DT, NN and NB classifiers were 87%, 79.7%, 76.8% and 73.9% respectively. Also, using Genetic Algorithm (GA) based feature selection method, online technologies self-efficacy, online learning readiness, and previous online experience were found as the most important factors in predicting the dropouts.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • 1. Allen I.E. and Seaman J. (2007). Online nation: Five years of growth in online learning. Needham MA: Sloan Consortium.

  • 2. Baker R.S.J.D. (2010). Data Mining for Education. In B. McGaw P. Peterson E. Baker (eds.) International Encyclopaedia of Education (3rd edition) (pp. 112-118). Oxford UK: Elsevier

  • 3. Baker R. and Siemens G. (in press). Educational data mining and learning analytics. To appear in Sawyer K. (ed.) Cambridge Handbook of the Learning Sciences: 2nd Edition.

  • 4. Beck J. and Woolf B.P. (2000). High-level student modeling with machine learning. In G. Gauthier C. Frasson & K. VanLehn (eds.) Proceedings of Fifth International Conference on Intelligent Tutoring Systems (pp. 584-593). Berlin: Springer-Verlag Berlin & Heidelberg GmbH & Co. K.

  • 5. Beikzadeh M.R.; Phon-Amnuaisuk S. and Delavari N. (2008). Data mining application in higher learning institutions. In International Journal of Informatics in Education 7(1) (pp. 31-54).

  • 6. Benoît G. (2002). Data mining. In Annual Review of Information Science and Technology 36 (pp. 265-310).

  • 7. Berge Z. and Huang Y. (2004). A Model for Sustainable Student Retention: A Holistic Perspective on the Student Dropout Problem with Special Attention to e-Learning. In DEOSNEWS 13(5) Retrieved July 292011

  • 8. Berson A.; Smith S. and Thearling K. (2000). Building Data Mining Applications for CRM. New York: McGraw-Hill Professional Publishing.

  • 9. Black E.W.; Dawson K. and Priem J. (2008). Data for free: using LMS activity logs to measure community in online courses. In The Internet and Higher Education 11(2) (pp. 65-70).

  • 10. Carr S. (2000). As distance education comes of age the challenge is keeping the students. In The Chronicle of Higher Education 46(23) (pp. A39-A41).

  • 11. Chaudhuri S. (1998). Data Mining and Database Systems: Where is the Intersection? In IEEE Bulletin of the Technical Committee on Data Engineering 21(1) (pp. 4-8).

  • 12. Chen G.; Liu C.; Ou K. and Liu B. (2000). Discovering decision knowledge from web log portfolio for managing classroom processes by applying decision tree and data cube technology. In Journal of Educational Computing Research 23(3) (pp. 305-332).

  • 13. Cortez P. and Silva A. (2008). Using Data Mining to Predict Secondary School Student Performance. In A. Brito & J. Teixeira (eds.) EUROSIS (pp.5-12).

  • 14. Davis L. (1991). Handbook of Genetic Algorithms. New York NY: Van Nostrand Reinhold

  • 15. Dag I. (1991). The reliability and validity study of Rotter’s IE/LOC scale for university students. In Turkish Journal of Psychiatry 7(26) (pp. 10-16).

  • 16. Dekker G.W.; Pechenizkiy M. and Vleeshouwers J.M. (2009). Predicting student drop out: A case study. In T. Barnes M. Desmarais C. Romero & S. Ventura (eds.) Proceedings of the 2nd International Conference on Educational Data Mining EDM 2009 Retrieved July 29 2011 from

  • 17. Domingos P. and Pazzani M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. In Machine Learning 29 (pp. 103-130).

  • 18. Durfee A.; Schneberger S. and Amoroso D.L. (2007). Evaluating students’ computer-based learning using a visual data mining approach. In Journal of Informatics Education Research 9(1) (pp. 1-28).

  • 19. Fayyad U.M.; Pitatesky-Shapiro G.; Smyth P. and Uthurasamy R. (1996). Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press Cambridge.

  • 20. Flach P. (2003). The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In T. Fawcett & N. Mishra (eds.) Proceedings 20th International Conference on Machine Learning (ICML’03) (pp. 194-201). AAAI Press.

  • 21. Flach P. and Lachiche N. (2004). Naive Bayesian classification of structured data. In Machine Learning 57(3) (pp. 233-269).

  • 22. Gibbs M.R. (2003). Knowledge Sharing and Socialization in Distributed Communities of Practice. In R.M. Verburg & J.A. De Ridder (eds.) Knowledge Sharing Under Distributed Circumstances Amsterdam: Netherlands Organization for Scientific Research.

  • 23. Hämäläinen W. and Vinni M. (2010). Classifiers for educational technology. In C. Romero S. Ventura M. Pechenizkiy R.S.J.d. Baker (eds.) Handbook of Educational Data Mining (pp. 54-74). CRC Press.

  • 24. Hämäläinen W.; Suhonen J.; Sutinen E. and Toivonen H. (2004) Data mining in personalizing distance education courses. In World conference on open learning and distance education. Retrieved July 29 2011 from

  • 25. Han J. and Kamber M. (2006). Data Mining Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems 2nd Edition. San Francisco: Elsevier Inc.

  • 26. Hand D.; Mannila H. and Smyth P. (2002). Principles of data mining. Cambridge Massachussetts USA: MIT Press.

  • 27. Herzog S. (2006). Estimating student retention and degree-completion time: Decision trees and neural networks vis-à-vis regression. In New Directions for Institutional Research 2006(131) (pp. 17-33).

  • 28. Hung J. and Zhang K. (2008). Revealing online learning behaviors and activity patterns and making predictions with data mining techniques in online teaching. In MERLOT Journal of Online Learning and Teaching 4(4) (pp. 426-437).

  • 29. Inan F.A. Yukselturk E. and Grant M.M. (2009). Profiling potential dropout students by individual characteristics in an online certificate program. In International Journal of Instructional Media 36(2) (pp. 163-176).

  • 30. Isler V. (1998). Distance Education Experiences of the Middle East Technical University. Paper presented at MEDISAT-EUREKA: Joint Workshop: Internet as a Medium for Innovation and Technology Development in Eastern Mediterranean Tubitak-Bilten & EU/INCO-DC Ankara Turkey.

  • 31. Kotsiantis S.B. (2007). Supervised Machine Learning: A Review of Classification Techniques. In Informatica 31(3) (pp. 249-268).

  • 32. Kotsiantis S.; Pierrakeas C. and Pintelas P. (2003). Preventing student dropout in distance learning systems using machine learning techniques. In Knowledge-Based Intelligent Information and Engineering Systems (pp. 267-274).

  • 33. Lile A. (2011). Analyzing E-Learning Systems Using Educational Data Mining Techniques. In Mediterranean Journal of Social Sciences 2(3) (pp. 403-419). DOI: 10.5901/mjss.2011.v2n3p403

  • 34. Lykourentzou I.; Giannoukos I.; Nikolopoulos V.; Mpardis G. and Loumos V. (2009). Dropout prediction in e-learning courses through the combination of machine learning techniques. In Computers & Education 53(3) (pp. 950-965).

  • 35. McCarthy J.S. and Earp M.S. (2009). Who makes mistakes? Using data mining techniques to analyze reporting errors in total acres operated. National Agricultural Statistics Service RDD Research Report Number RDD-09-02. Retrieved January 21 2012 from

  • 36. Mcvay M. (2000). Developing a Web-based distance student orientation to enhance student success in an online Bachelor’s degree completion program. Unpublished practicum report presented to the Ed.D. Program Nova Southeastern University Florida.

  • 37. Mitchell T. (1997). Machine Learning. New York: McGraw Hill.

  • 38. Miltiadou M. and Yu C.H. (2000). Validation of the online technologies self-efficacy survey (OTSES). (ERIC Document Reproduction Service No. ED. 445672).

  • 39. Minaei-Bidgoli B.; Kashy D.; Kortemeyer G. and Punch W. (2003). Predicting student performance: An application of data mining methods with an educational web-based system. In Proceeding of IEEE Frontiers in Education (pp. 13-18). Colorado USA.

  • 40. Minaei-Bidgoli B.; Kortemeyer G. and Punch W.F. (2004). Enhancing online learning performance: An application of data mining methods. Paper presented at the 7th IASTED International Conference on Computers and Advanced Technology in Education (CATE 2004) Retrieved July 29 2011 from http://www.loncapa. org/papers/Behrouz_CATE2004.pdf

  • 41. Quinlan J.R. (1993). C4.5: Programs for machine learning. San Francisco CA.: Morgan Kaufmann Publishers.

  • 42. Romero C. and Ventura S. (2007). Educational Data Mining: A Survey from 1995 to 2005. In Expert Systems with Applications 33(1) (pp. 135-146).

  • 43. Romero C.; Ventura S.; Castro C.; Hall W. and Ng M.H. (2002). Using genetic algorithms for data mining in web-based educational hypermedia systems. In Proceedings of AH2002 workshop Adaptive Systems for Web-based Education Malaga Spain.

  • 44. Romero C.; Ventura S. and García E. (2008). Data mining in course management systems: Moodle case study and tutorial. In Computers & Education 51(1) (pp. 368-384).

  • 45. Romero C.; Ventura S.; Espejo P.G.; Hervas C. (2008) Data Mining Algorithms to Classify Students. In Proceedings of the First International Conference on Educational Data Mining (pp. 8-17).

  • 46. Rotter J.B. (1966). Generalized expectancies for internal versus external control of reinforcement. In Psychological Monographs: General and Applied 80(1) (pp. 1-26).

  • 47. Schouten B. and de Nooij G. (2005). Nonresponse adjustment using classification trees. Discussion Paper 05001 Voorburg/Heerlen: Statistics Netherlands.

  • 48. Scime A. and Murray G.R. (2007). Vote prediction by iterative domain knowledge and attribute elimination. In International Journal of Business Intelligence and Data Mining 2(2) (pp. 160-176).

  • 49. Simpson O. (2004).The impact on retention of interventions to support distance learning students. In Open Learning 19(1) (pp. 79-96).

  • 50. Su J.-M.; Tseng S.-S.; Wang W.; Weng J.-F.; Yang J.T.D. and Tsai W.-N. (2006). Learning Portfolio Analysis and Mining for SCORM Compliant Environment. In Educational Technology & Society 9(1) (pp. 262-275).

  • 51. Superby J.F. ; Vandamme J.P. and Meskens N. (2006). Determination of factors influencing the achievement of the first-year university students using data mining methods. In Proceedings of the workshop on educational data mining ITS’06 (pp. 37-44).

  • 52. Talavera L. and Gaudioso E. (2004). Mining student data to characterize similar behavior groups in unstructured collaboration spaces. Paper presented at Workshop on Artificial Intelligence in Computer Supported Collaborative Learning at European Conference on Artificial Intelligence. Retrieved July 29 2011 from

  • 53. Vuk M. and Curk T. (2006). ROC curve lift chart and calibration plot. In Metodoloˇski zvezki 3(1) (pp. 89-108).

  • 54. Wang W.; Weng J.; Su J. and Tseng S. (2004). Learning portfolio analysis and mining in SCORM compliant environment. Paper presented at the 34th ASEE/IEEE Frontiers in Education Conference Savannah GA. Retrieved July 29 2011 from

  • 55. Willging P.A. and Johnson S.D. (2004). Factors that influence students’ decision to dropout of online courses. In Journal of Asynchronous Learning Networks 8(4) (pp. 105-118).

  • 56. Yukselturk E. (2009). Do Entry Characteristics of Online Learners Affect Their Satisfaction? In International Journal on E-Learning 8(2) (pp. 263-281).

  • 57. Yukselturk E. and Inan F.A. (2006). Examining the Factors Affecting Student Dropout in an Online Certificate Program. In Turkish Online Journal of Distance Education-TOJDE 7(3) Retrieved July 29 2011 from

  • 58. Zang W. and Lin F. (2003). Investigation of web-based teaching and learning by boosting algorithms. In Proceedings of IEEE International Conference on Information Technology: Research and Education 2003 (pp. 445-449).

  • 59. Zhao C. and Luan J. (2006). Data mining: Going beyond traditional statistics. In New Directions for Institutional Research 131(2) (pp. 7-16).

Journal information
Cited By
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 1313 505 44
PDF Downloads 836 263 37