A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Abstract

The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] M. K. Saggi and S. Jain, “A Survey Towards an Integration of Big Data Analytics to Big Insights for Value-Creation,” Information Processing & Management, vol. 54, no. 5, pp. 758–790, Sep. 2018. https://doi.org/10.1016/j.ipm.2018.01.010

  • [2] A. Oussous, F. Z. Benjelloun, A. A. Lahcen, and S. Belfkih, “Big Data Technologies: A survey,” Journal of King Saud University – Computer and Information Sciences, vol. 30, no. 4, pp. 431–448, Oct. 2018. https://doi.org/10.1016/j.jksuci.2017.06.001

  • [3] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning From Class-Imbalanced Data: Review of Methods and Applications,” Expert Systems with Applications, vol. 73, pp. 220–239, May 2017. https://doi.org/10.1016/j.eswa.2016.12.035

  • [4] H. He and E. A. Garcia, “Learning From Imbalanced Data,” IEEE Transactions on Knowledge & Data Engineering, vol. 21, no. 9, pp. 1263–1284, Sep. 2009. https://doi.org/10.1109/TKDE.2008.239

  • [5] S. Das, S. Datta, and B. B. Chaudhuri, “Handling Data Irregularities in Classification: Foundations, Trends, and Future Challenges,” Pattern Recognition, vol. 81, pp. 674–693, Sep. 2018. https://doi.org/10.1016/j.patcog.2018.03.008

  • [6] J. Stefanowski, “Dealing With Data Difficulty Factors While Learning From Imbalanced Data,” in Challenges in Computational Statistics and Data Mining, pp. 333–363, 2016. https://doi.org/10.1007/978-3-319-18781-5_17

  • [7] A. Fernández, S. del Río, N. V. Chawla, and F. Herrera, “An Insight Into Imbalanced Big Data Classification: Outcomes and Challenges,” Complex & Intelligent Systems, vol. 3, no. 2, pp. 105–120, Jun. 2017. https://doi.org/10.1007/s40747-017-0037-9

  • [8] S. del Río, V. López, J. M. Benítez, and F. Herrera, “On the Use of MapReduce for Imbalanced Big Data Using Random Forest,” Information Sciences, vol. 285, pp. 112–137, 2014. https://doi.org/10.1016/j.ins.2014.03.043

  • [9] S. S. Patil and S. P. Sonavane, “Enriched Over_Sampling Techniques for Improving Classification of Imbalanced Big Data,” in 2017 IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService), USA, 2017, pp. 1–10. https://doi.org/10.1109/BigDataService.2017.19

  • [10] M. Ghanavati, R. K. Wong, F. Chen, Y. Wang, and C. S. Perng, “An Effective Integrated Method for Learning Big Imbalanced Data,” in 2014 IEEE International Congress on Big Data, USA, 2014, pp. 691–698. https://doi.org/10.1109/BigData.Congress.2014.102

  • [11] D. Galpert, S. del Río, F. Herrera, E. Ancede-Gallardo, A. Antunes, and G. Agüero-Chapin, “An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species,” BioMed Research International, vol. 2015, Article ID 748681, 2015. https://doi.org/10.1155/2015/748681

  • [12] S. del Río, J. M. Benítez, and F. Herrera, “Analysis of Data Preprocessing Increasing the Oversampling Ratio for Extremely Imbalanced Big Data Classification,” in 2015 IEEE Trustcom/BigDataSE/ISPA, pp. 180–185, Finland, 2015. https://doi.org/10.1109/Trustcom.2015.579

  • [13] I. Triguero, S. del Río, V. López, J. Bacardit, J. M. Benítez, and F. Herrera, “ROSEFW-RF: The Winner Algorithm for the ECBDL’14 Big Data Competition: An Extremely Imbalanced Big Data Bioinformatics Problem,” Knowledge-Based Systems, vol. 87, pp. 69–79, Oct. 2015. https://doi.org/10.1016/j.knosys.2015.05.027

  • [14] I. Triguero, M. Galar, S. Vluymans, C. Cornelis, H. Bustince, F. Herrera, and Y. Saeys, “Evolutionary Undersampling for Imbalanced Big Data Classification,” in 2015 IEEE Congress on Evolutionary Computation (CEC), Japan, 2015, pp. 715–722. https://doi.org/10.1109/CEC.2015.7256961

  • [15] I. Triguero, M. Galar, D. Merino, J. Maillo, H. Bustince, and F. Herrera, “Evolutionary Undersampling for Extremely Imbalanced Big Data Classification Under Apache Spark,” in 2016 IEEE Congress on Evolutionary Computation (CEC), Canada, 2016, pp. 640–647. https://doi.org/10.1109/CEC.2016.7743853

  • [16] S. Kamal, S.H. Ripon, N. Dey, A.S. Ashour, and V. Santhi, “A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset,” Computer methods and programs in biomedicine, vol. 131, pp. 191–206, Jul. 2016. https://doi.org/10.1016/j.cmpb.2016.04.005

  • [17] F. Hu, H. Li, H. Lou, and J. Dai, “A parallel oversampling algorithm based on NRSBoundary-SMOTE,” Journal of Information & Computational Science, vol. 11, no. 13, pp. 4655–4665, Sep. 2014. https://doi.org/10.12733/jics20104484

  • [18] R. C. Bhagat and S. S. Patil, “Enhanced SMOTE Algorithm for Classification of Imbalanced Big-Data Using Random Forest,” in 2015 IEEE International Advance Computing Conference (IACC), India, 2015, pp. 403–408. https://doi.org/10.1109/IADCC.2015.7154739

  • [19] C. K. Maurya, D. Toshniwal, and G. V. Venkoparao, “Online Sparse Class Imbalance Learning on Big Data,” Neurocomputing, vol. 216, pp. 250–260, Dec. 2016. https://doi.org/10.1016/j.neucom.2016.07.040

  • [20] M. Tang, C. Yang, K. Zhang, Q. Xie, “Cost-Sensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method for Big Class-Imbalanced Data Classification,” Abstract and Applied Analysis, vol. 2014, Article ID 416591, Jul. 2014. https://doi.org/10.1155/2014/416591

  • [21] X. Wang, X., Liu, and S. Matwin, “A distributed instance-weighted SVM algorithm on large-scale imbalanced datasets”. in 2014 IEEE International Conference on Big Data, USA, 2014, pp. 45–51. https://doi.org/10.1109/BigData.2014.7004467

  • [22] V. López, S. del Río, J. M. Benítez, and F. Herrera, “Cost-Sensitive Linguistic Fuzzy Rule Based Classification Systems Under the MapReduce Framework for Imbalanced Big Data,” Fuzzy Sets and Systems, vol. 258, pp. 5–38, Jan. 2015. https://doi.org/10.1016/j.fss.2014.01.015

  • [23] S. del Rio, V. Lopez, J. M. Benítez, and F. Herrera, “A MapReduce Approach to Address Big Data Classification Problems Based on the Fusion of Linguistic Fuzzy Rules,” International Journal of Computational Intelligence Systems, vol. 8, no. 3, pp. 422–437, May 2015. https://doi.org/10.1080/18756891.2015.1017377

  • [24] J. Zhai, S. Zhang, M. Zhang, and X. Liu, “Fuzzy Integral-Based ELM Ensemble for Imbalanced Big Data Classification,” Soft Computing, vol. 22, no. 11, pp. 3519–3531, Jun. 2018. https://doi.org/10.1007/s00500-018-3085-1

  • [25] Z. Wang, J. Xin, H. Yang, S. Tian, G. Yu, C. Xu, and Y. Yao, “Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning,” Tsinghua Science and Technology, vol. 22, no. 2, pp. 160–173, Apr. 2017. https://doi.org/10.23919/TST.2017.7889638

  • [26] N. B. Abdel-Hamid, S. ElGhamrawy, A. El Desouky, and H. Arafat, “A Dynamic Spark-Based Classification Framework for Imbalanced Big Data,” Journal of Grid Computing, vol. 16, no. 4, pp. 607–626, Dec. 2018. https://doi.org/10.1007/s10723-018-9465-z

  • [27] J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, “A Survey on Addressing High-Class Imbalance in Big Data,” Journal of Big Data, vol. 5, no. 42, Dec. 2018. https://doi.org/10.1186/s40537-018-0151-6

  • [28] J. W. Huang, C. W. Chiang, and J. W. Chang, “Email Security Level Classification of Imbalanced Data Using Artificial Neural Network: The Real Case in a World-Leading Enterprise,” Engineering Applications of Artificial Intelligence, vol. 75, pp. 11–21, Oct. 2018. https://doi.org/10.1016/j.engappai.2018.07.010

  • [29] T. Jo, and N. Japkowicz, “Class Imbalances Versus Small Disjuncts,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 40–49, Jun. 2004. https://doi.org/10.1145/1007730.1007737

  • [30] A. Agrawal, H. L. Viktor, E. Paquet, “SCUT: Multi-Class Imbalanced Data Classification Using SMOTE and Cluster-Based Undersampling,” in 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), 2015, vol. 1, pp. 226–234. https://doi.org/10.5220/0005595502260234

  • [31] W. C. Lin, C. F. Tsai, Y. H. Hu, and J. S. Jhang, “Clustering-Based Undersampling in Class-Imbalanced Data,” Information Sciences, vol. 409, pp. 17–26, Oct. 2017. https://doi.org/10.1016/j.ins.2017.05.008

  • [32] I. Nekooeimehr and S. K. Lai-Yuen, “Adaptive Semi-Unsupervised Weighted Oversampling (A-SUWO) for Imbalanced Datasets,” Expert Systems with Applications, vol. 46, pp. 405–416, Mar. 2016. https://doi.org/10.1016/j.eswa.2015.10.031

  • [33] A. Estabrooks, T. Jo, and N. Japkowicz, “A Multiple Resampling Method for Learning from Imbalanced Data Sets,” Computational Intelligence, vol. 20, no. 1, pp. 18–36, Feb. 2004. https://doi.org/10.1111/j.0824-7935.2004.t01-1-00228.x

  • [34] H. Guo, J. Zhou, and C. A. Wu, “Imbalanced Learning Based on Data-Partition and SMOTE,” Information, vol. 9, no. 238, Sep. 2018. https://doi.org/10.3390/info9090238

  • [35] GAZİ-BIDISEC. Gazi University Big Data and Information Security Center. [Online]. Available: http://bigdatacenter.gazi.edu.tr/ [Accessed: Sep. 2019].

  • [36] T. Hasanin and T. Khoshgoftaar, “The Effects of Random Undersampling with Simulated Class Imbalance for Big Data,” in 2018 IEEE International Conference on Information Reuse and Integration (IRI), USA, 2018, pp. 70–79. https://doi.org/10.1109/IRI.2018.00018

OPEN ACCESS

Journal + Issues

Search