Decision-Making Enhancement in a Big Data Environment: Application of the K-Means Algorithm to Mixed Data

Open access


Big data research has become an important discipline in information systems research. However, the flood of data being generated on the Internet is increasingly unstructured and non-numeric in the form of images and texts. Thus, research indicates that there is an increasing need to develop more efficient algorithms for treating mixed data in big data for effective decision making. In this paper, we apply the classical K-means algorithm to both numeric and categorical attributes in big data platforms. We first present an algorithm that handles the problem of mixed data. We then use big data platforms to implement the algorithm, demonstrating its functionalities by applying the algorithm in a detailed case study. This provides us with a solid basis for performing more targeted profiling for decision making and research using big data. Consequently, the decision makers will be able to treat mixed data, numerical and categorical data, to explain and predict phenomena in the big data ecosystem. Our research includes a detailed end-to-end case study that presents an implementation of the suggested procedure. This demonstrates its capabilities and the advantages that allow it to improve the decision-making process by targeting organizations’ business requirements to a specific cluster[s]/profiles[s] based on the enhancement outcomes.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] Ahmed Abbasi Suprateek Sarker and Roger HL Chiang. Big data research in information systems: Toward an inclusive research agenda. Journal of the Association for Information Systems 17(2):I 2016.

  • [2] Ritu Agarwal and Vasant Dhar. Big data data science and analytics: The opportunity and challenge for is research. Information Systems Research 25(3):443–448 2014.

  • [3] Amir Ahmad and Lipika Dey. A k-mean clustering algorithm for mixed numeric and categorical data. Data & Knowledge Engineering 63(2):503–527 2007.

  • [4] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data pages 25–71. Springer 2006.

  • [5] Xiao Cai Feiping Nie and Heng Huang. Multi-view k-means clustering on big data. In Twenty-Third International Joint conference on artificial intelligence 2013.

  • [6] Xiaoli Cui Pingfei Zhu Xin Yang Keqiu Li and Changqing Ji. Optimized big data k-means clustering using mapreduce. The Journal of Supercomputing 70(3):1249–1259 2014.

  • [7] Kenneth Cukier and Viktor Mayer-Schoenberger. The rise of big data: How it’s changing the way we think about the world. Foreign Aff. 92:28 2013.

  • [8] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1):107–113 2008.

  • [9] Yuri Demchenko Canh Ngo and Peter Membrey. Architecture framework and components for the big data ecosystem. Journal of System and Network Engineering pages 1–31 2013.

  • [10] Dany Di Tullio and D Sandy Staples. The governance and control of open source software projects. Journal of Management Information Systems 30(3):49–80 2013.

  • [11] Gal Engelberg Oded Koren and Nir Perel. Big data performance evaluation analysis using apache pig. International Journal of Software Engineering and Its Applications 10(11):429–440 2016.

  • [12] Johann Füller Katja Hutter Julia Hautz and Kurt Matzler. User roles and contributions in innovation-contest communities. Journal of Management Information Systems 31(1):273–308 2014.

  • [13] Sanjay Ghemawat Howard Gobioff and Shun-Tak Leung. The google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles pages 20–43 Bolton Landing NY 2003.

  • [14] Shanshan Guo Xitong Guo Yulin Fang and Doug Vogel. How doctors gain social and economic returns in online health-care communities: a professional capital perspective. Journal of Management Information Systems 34(2):487–519 2017.

  • [15] Bock Hans-Hermann. Origins and extensions of the k-means algorithm in cluster analysis. Journal Electronique dHistoire des Probabilités et de la Statistique Electronic Journal for History of Probability and Statistics 4:48–49 2008.

  • [16] Doug Henschen. Why sears is going all-in on hadoop. Information week. Retrieved July 1:2014 2012.

  • [17] Joshua Zhexue Huang Michael K Ng Hongqiang Rong and Zichen Li. Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis & Machine Intelligence (5):657–668 2005.

  • [18] Zhexue Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery 2(3):283–304 1998.

  • [19] Cisco Visual Networking Index. The zettabyte era–trends and analysis. Cisco white paper 2013.

  • [20] Anil K Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters 31(8):651–666 2010.

  • [21] Tapas Kanungo David M Mount Nathan S Netanyahu Christine D Piatko Ruth Silverman and Angela Y Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis & Machine Intelligence (7):881–892 2002.

  • [22] Daniel Kendal Oded Koren and Nir Perel. Pig vs. hive use case analysis. International Journal of Database Theory and Application 9(12):267–276 2016.

  • [23] Oded Koren Carina Antonia Hallin Nir Perel and Dror Bendet. Enhancement of the k-means algorithm for mixed data in big data platforms. In Proceedings of SAI Intelligent Systems Conference pages 1025–1040. Springer 2018.

  • [24] Sara Landset Taghi M Khoshgoftaar Aaron N Richter and Tawfiq Hasanin. A survey of open source tools for machine learning with big data in the hadoop ecosystem. Journal of Big Data 2(1):24 2015.

  • [25] James Manyika Michael Chui Brad Brown Jacques Bughin Richard Dobbs Charles Roxburgh and Angela H Byers. Big data: The next frontier for innovation competition and productivity. McKinsey Global Institute 2011.

  • [26] R Angelin Preethi and J Elavarasi. Big data analytics using hadoop tools pache hive vs apache pig. International Journal of Emerging Technology in Computer Science & Electronics 24(3) 2017.

  • [27] Arun Rai. Editor’s comments: Synergies between big data and theory. MIS quarterly 40(2):iii–ix 2016.

  • [28] Henri Ralambondrainy. A conceptual version of the k-means algorithm. Pattern Recognition Letters 16(11):1147–1157 1995.

  • [29] Alok R Saboo V Kumar and Insu Park. Using big data to model time-varying effects for marketing resource (re) allocation. MIS Quarterly 40(4) 2016.

  • [30] Ohn Mar San Van-Nam Huynh and Yoshiteru Nakamori. An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science 14:241–247 2004.

  • [31] Prasanna Tambe. Big data investment skills and firm value. Management Science 60(6):1452–1469 2014.

  • [32] Tom White. Hadoop: The definitive guide. O’Reilly Media Inc. 2012.

  • [33] Rui Xu and Donald C Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3):645–678 2005.

Journal information
Impact Factor

CiteScore 2018: 4.70

SCImago Journal Rank (SJR) 2018: 0.351
Source Normalized Impact per Paper (SNIP) 2018: 4.066

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 324 324 39
PDF Downloads 204 204 37