Predicting Dropout Student: An Application of Data Mining Methods in an Online Education Program

Open access


This study examined the prediction of dropouts through data mining approaches in an online program. The subject of the study was selected from a total of 189 students who registered to the online Information Technologies Certificate Program in 2007-2009. The data was collected through online questionnaires (Demographic Survey, Online Technologies Self-Efficacy Scale, Readiness for Online Learning Questionnaire, Locus of Control Scale, and Prior Knowledge Questionnaire). The collected data included 10 variables, which were gender, age, educational level, previous online experience, occupation, self efficacy, readiness, prior knowledge, locus of control, and the dropout status as the class label (dropout/not). In order to classify dropout students, four data mining approaches were applied based on k-Nearest Neighbour (k-NN), Decision Tree (DT), Naive Bayes (NB) and Neural Network (NN). These methods were trained and tested using 10-fold cross validation. The detection sensitivities of 3-NN, DT, NN and NB classifiers were 87%, 79.7%, 76.8% and 73.9% respectively. Also, using Genetic Algorithm (GA) based feature selection method, online technologies self-efficacy, online learning readiness, and previous online experience were found as the most important factors in predicting the dropouts.

1. Allen, I.E. and Seaman, J. (2007). Online nation: Five years of growth in online learning. Needham, MA: Sloan Consortium.

2. Baker, R.S.J.D. (2010). Data Mining for Education. In B. McGaw, P. Peterson, E. Baker (eds.), International Encyclopaedia of Education (3rd edition), (pp. 112-118). Oxford, UK: Elsevier

3. Baker, R. and Siemens, G. (in press). Educational data mining and learning analytics. To appear in Sawyer, K. (ed.), Cambridge Handbook of the Learning Sciences: 2nd Edition.

4. Beck, J. and Woolf, B.P. (2000). High-level student modeling with machine learning. In G. Gauthier, C. Frasson & K. VanLehn (eds.), Proceedings of Fifth International Conference on Intelligent Tutoring Systems, (pp. 584-593). Berlin: Springer-Verlag Berlin & Heidelberg GmbH & Co. K.

5. Beikzadeh, M.R.; Phon-Amnuaisuk, S. and Delavari, N. (2008). Data mining application in higher learning institutions. In International Journal of Informatics in Education, 7(1), (pp. 31-54).

6. Benoît, G. (2002). Data mining. In Annual Review of Information Science and Technology, 36, (pp. 265-310).

7. Berge, Z. and Huang, Y. (2004). A Model for Sustainable Student Retention: A Holistic Perspective on the Student Dropout Problem with Special Attention to e-Learning. In DEOSNEWS, 13(5), Retrieved July 29,2011,

8. Berson, A.; Smith, S. and Thearling, K. (2000). Building Data Mining Applications for CRM. New York: McGraw-Hill Professional Publishing.

9. Black, E.W.; Dawson, K. and Priem, J. (2008). Data for free: using LMS activity logs to measure community in online courses. In The Internet and Higher Education, 11(2), (pp. 65-70).

10. Carr, S. (2000). As distance education comes of age, the challenge is keeping the students. In The Chronicle of Higher Education, 46(23), (pp. A39-A41).

11. Chaudhuri, S. (1998). Data Mining and Database Systems: Where is the Intersection? In IEEE Bulletin of the Technical Committee on Data Engineering, 21(1), (pp. 4-8).

12. Chen, G.; Liu, C.; Ou, K. and Liu, B. (2000). Discovering decision knowledge from web log portfolio for managing classroom processes by applying decision tree and data cube technology. In Journal of Educational Computing Research, 23(3), (pp. 305-332).

13. Cortez, P. and Silva, A. (2008). Using Data Mining to Predict Secondary School Student Performance. In A. Brito & J. Teixeira (eds.), EUROSIS, (pp.5-12).

14. Davis, L. (1991). Handbook of Genetic Algorithms. New York, NY: Van Nostrand Reinhold

15. Dag, I. (1991). The reliability and validity study of Rotter’s IE/LOC scale for university students. In Turkish Journal of Psychiatry, 7(26), (pp. 10-16).

16. Dekker, G.W.; Pechenizkiy, M. and Vleeshouwers, J.M. (2009). Predicting student drop out: A case study. In T. Barnes, M. Desmarais, C. Romero & S. Ventura (eds.), Proceedings of the 2nd International Conference on Educational Data Mining, EDM 2009, Retrieved July 29, 2011, from

17. Domingos, P. and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. In Machine Learning, 29, (pp. 103-130).

18. Durfee, A.; Schneberger, S. and Amoroso, D.L. (2007). Evaluating students’ computer-based learning using a visual data mining approach. In Journal of Informatics Education Research, 9(1), (pp. 1-28).

19. Fayyad, U.M.; Pitatesky-Shapiro, G.; Smyth, P. and Uthurasamy, R. (1996). Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Cambridge.

20. Flach, P. (2003). The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In T. Fawcett & N. Mishra (eds.), Proceedings 20th International Conference on Machine Learning (ICML’03), (pp. 194-201). AAAI Press.

21. Flach, P. and Lachiche, N. (2004). Naive Bayesian classification of structured data. In Machine Learning, 57(3), (pp. 233-269).

22. Gibbs, M.R. (2003). Knowledge Sharing and Socialization in Distributed Communities of Practice. In R.M. Verburg & J.A. De Ridder (eds.), Knowledge Sharing Under Distributed Circumstances, Amsterdam: Netherlands Organization for Scientific Research.

23. Hämäläinen, W. and Vinni, M. (2010). Classifiers for educational technology. In C. Romero, S. Ventura, M. Pechenizkiy, R.S.J.d. Baker (eds.), Handbook of Educational Data Mining, (pp. 54-74). CRC Press.

24. Hämäläinen, W.; Suhonen, J.; Sutinen, E. and Toivonen, H. (2004) Data mining in personalizing distance education courses. In World conference on open learning and distance education. Retrieved July 29, 2011, from

25. Han, J. and Kamber, M. (2006). Data Mining Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems, 2nd Edition. San Francisco: Elsevier Inc.

26. Hand, D.; Mannila, H. and Smyth, P. (2002). Principles of data mining. Cambridge, Massachussetts, USA: MIT Press.

27. Herzog, S. (2006). Estimating student retention and degree-completion time: Decision trees and neural networks vis-à-vis regression. In New Directions for Institutional Research, 2006(131), (pp. 17-33).

28. Hung, J. and Zhang, K. (2008). Revealing online learning behaviors and activity patterns and making predictions with data mining techniques in online teaching. In MERLOT Journal of Online Learning and Teaching, 4(4), (pp. 426-437).

29. Inan, F.A., Yukselturk, E. and Grant, M.M. (2009). Profiling potential dropout students by individual characteristics in an online certificate program. In International Journal of Instructional Media, 36(2), (pp. 163-176).

30. Isler, V. (1998). Distance Education Experiences of the Middle East Technical University. Paper presented at MEDISAT-EUREKA: Joint Workshop: Internet as a Medium for Innovation and Technology Development in Eastern Mediterranean, Tubitak-Bilten & EU/INCO-DC, Ankara, Turkey.

31. Kotsiantis, S.B. (2007). Supervised Machine Learning: A Review of Classification Techniques. In Informatica, 31(3), (pp. 249-268).

32. Kotsiantis, S.; Pierrakeas, C. and Pintelas, P. (2003). Preventing student dropout in distance learning systems using machine learning techniques. In Knowledge-Based Intelligent Information and Engineering Systems, (pp. 267-274).

33. Lile A. (2011). Analyzing E-Learning Systems Using Educational Data Mining Techniques. In Mediterranean Journal of Social Sciences, 2(3), (pp. 403-419). DOI: 10.5901/mjss.2011.v2n3p403

34. Lykourentzou, I.; Giannoukos, I.; Nikolopoulos, V.; Mpardis, G. and Loumos, V. (2009). Dropout prediction in e-learning courses through the combination of machine learning techniques. In Computers & Education, 53(3), (pp. 950-965).

35. McCarthy, J.S. and Earp, M.S. (2009). Who makes mistakes? Using data mining techniques to analyze reporting errors in total acres operated. National Agricultural Statistics Service, RDD Research Report Number RDD-09-02. Retrieved, January 21, 2012 from,_Presentations_and_Conferences/reports/data-mining-reporting-errors.pdf

36. Mcvay, M. (2000). Developing a Web-based distance student orientation to enhance student success in an online Bachelor’s degree completion program. Unpublished practicum report presented to the Ed.D. Program, Nova Southeastern University, Florida.

37. Mitchell, T. (1997). Machine Learning. New York: McGraw Hill.

38. Miltiadou, M. and Yu, C.H. (2000). Validation of the online technologies self-efficacy survey (OTSES). (ERIC Document Reproduction Service No. ED. 445672).

39. Minaei-Bidgoli, B.; Kashy, D.; Kortemeyer, G. and Punch W. (2003). Predicting student performance: An application of data mining methods with an educational web-based system. In Proceeding of IEEE Frontiers in Education, (pp. 13-18). Colorado, USA.

40. Minaei-Bidgoli, B.; Kortemeyer, G. and Punch, W.F. (2004). Enhancing online learning performance: An application of data mining methods. Paper presented at the 7th IASTED International Conference on Computers and Advanced Technology in Education (CATE 2004), Retrieved July 29, 2011, from http://www.loncapa. org/papers/Behrouz_CATE2004.pdf

41. Quinlan, J.R. (1993). C4.5: Programs for machine learning. San Francisco, CA.: Morgan Kaufmann Publishers.

42. Romero, C. and Ventura, S. (2007). Educational Data Mining: A Survey from 1995 to 2005. In Expert Systems with Applications, 33(1), (pp. 135-146).

43. Romero, C.; Ventura, S.; Castro, C.; Hall, W. and Ng, M.H. (2002). Using genetic algorithms for data mining in web-based educational hypermedia systems. In Proceedings of AH2002 workshop Adaptive Systems for Web-based Education, Malaga, Spain.

44. Romero, C.; Ventura, S. and García, E. (2008). Data mining in course management systems: Moodle case study and tutorial. In Computers & Education, 51(1), (pp. 368-384).

45. Romero, C.; Ventura, S.; Espejo, P.G.; Hervas, C. (2008) Data Mining Algorithms to Classify Students. In Proceedings of the First International Conference on Educational Data Mining, (pp. 8-17).

46. Rotter, J.B. (1966). Generalized expectancies for internal versus external control of reinforcement. In Psychological Monographs: General and Applied, 80(1), (pp. 1-26).

47. Schouten, B. and de Nooij, G. (2005). Nonresponse adjustment using classification trees. Discussion Paper 05001, Voorburg/Heerlen: Statistics Netherlands.

48. Scime, A. and Murray, G.R. (2007). Vote prediction by iterative domain knowledge and attribute elimination. In International Journal of Business Intelligence and Data Mining, 2(2), (pp. 160-176).

49. Simpson, O. (2004).The impact on retention of interventions to support distance learning students. In Open Learning, 19(1), (pp. 79-96).

50. Su, J.-M.; Tseng, S.-S.; Wang, W.; Weng, J.-F.; Yang, J.T.D. and Tsai, W.-N. (2006). Learning Portfolio Analysis and Mining for SCORM Compliant Environment. In Educational Technology & Society, 9(1), (pp. 262-275).

51. Superby, J.F. ; Vandamme, J.P. and Meskens, N. (2006). Determination of factors influencing the achievement of the first-year university students using data mining methods. In Proceedings of the workshop on educational data mining, ITS’06, (pp. 37-44).

52. Talavera, L. and Gaudioso, E. (2004). Mining student data to characterize similar behavior groups in unstructured collaboration spaces. Paper presented at Workshop on Artificial Intelligence in Computer Supported Collaborative Learning at European Conference on Artificial Intelligence. Retrieved July 29, 2011, from

53. Vuk, M. and Curk, T. (2006). ROC curve, lift chart and calibration plot. In Metodoloˇski zvezki, 3(1), (pp. 89-108).

54. Wang, W.; Weng, J.; Su, J. and Tseng, S. (2004). Learning portfolio analysis and mining in SCORM compliant environment. Paper presented at the 34th ASEE/IEEE Frontiers in Education Conference, Savannah, GA. Retrieved July 29, 2011, from

55. Willging, P.A. and Johnson, S.D. (2004). Factors that influence students’ decision to dropout of online courses. In Journal of Asynchronous Learning Networks, 8(4), (pp. 105-118).

56. Yukselturk, E. (2009). Do Entry Characteristics of Online Learners Affect Their Satisfaction? In International Journal on E-Learning, 8(2), (pp. 263-281).

57. Yukselturk, E. and Inan, F.A. (2006). Examining the Factors Affecting Student Dropout in an Online Certificate Program. In Turkish Online Journal of Distance Education-TOJDE, 7(3), Retrieved July 29, 2011, from

58. Zang, W. and Lin, F. (2003). Investigation of web-based teaching and learning by boosting algorithms. In Proceedings of IEEE International Conference on Information Technology: Research and Education, 2003, (pp. 445-449).

59. Zhao, C. and Luan, J. (2006). Data mining: Going beyond traditional statistics. In New Directions for Institutional Research, 131(2), (pp. 7-16).

Journal Information

Cited By


All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 794 794 69
PDF Downloads 566 566 36