Decision-Making Enhancement in a Big Data Environment: Application of the K-Means Algorithm to Mixed Data

Oded Koren 1 , Carina Antonia Hallin 2 , Nir Perel 1 ,  and Dror Bendet 1
  • 1 School of Industrial Engineering and Management, Shenkar - Engineering. Design. Art, , Ramat Gan, Israel
  • 2 Department of International Economics, Government and Business, Copenhagen Business School, Frederiksberg, Denmark

Abstract

Big data research has become an important discipline in information systems research. However, the flood of data being generated on the Internet is increasingly unstructured and non-numeric in the form of images and texts. Thus, research indicates that there is an increasing need to develop more efficient algorithms for treating mixed data in big data for effective decision making. In this paper, we apply the classical K-means algorithm to both numeric and categorical attributes in big data platforms. We first present an algorithm that handles the problem of mixed data. We then use big data platforms to implement the algorithm, demonstrating its functionalities by applying the algorithm in a detailed case study. This provides us with a solid basis for performing more targeted profiling for decision making and research using big data. Consequently, the decision makers will be able to treat mixed data, numerical and categorical data, to explain and predict phenomena in the big data ecosystem. Our research includes a detailed end-to-end case study that presents an implementation of the suggested procedure. This demonstrates its capabilities and the advantages that allow it to improve the decision-making process by targeting organizations’ business requirements to a specific cluster[s]/profiles[s] based on the enhancement outcomes.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] Ahmed Abbasi, Suprateek Sarker, and Roger HL Chiang. Big data research in information systems: Toward an inclusive research agenda. Journal of the Association for Information Systems, 17(2):I, 2016.

  • [2] Ritu Agarwal and Vasant Dhar. Big data, data science, and analytics: The opportunity and challenge for is research. Information Systems Research, 25(3):443–448, 2014.

  • [3] Amir Ahmad and Lipika Dey. A k-mean clustering algorithm for mixed numeric and categorical data. Data & Knowledge Engineering, 63(2):503–527, 2007.

  • [4] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25–71. Springer, 2006.

  • [5] Xiao Cai, Feiping Nie, and Heng Huang. Multi-view k-means clustering on big data. In Twenty-Third International Joint conference on artificial intelligence, 2013.

  • [6] Xiaoli Cui, Pingfei Zhu, Xin Yang, Keqiu Li, and Changqing Ji. Optimized big data k-means clustering using mapreduce. The Journal of Supercomputing, 70(3):1249–1259, 2014.

  • [7] Kenneth Cukier and Viktor Mayer-Schoenberger. The rise of big data: How it’s changing the way we think about the world. Foreign Aff., 92:28, 2013.

  • [8] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

  • [9] Yuri Demchenko, Canh Ngo, and Peter Membrey. Architecture framework and components for the big data ecosystem. Journal of System and Network Engineering, pages 1–31, 2013.

  • [10] Dany Di Tullio and D Sandy Staples. The governance and control of open source software projects. Journal of Management Information Systems, 30(3):49–80, 2013.

  • [11] Gal Engelberg, Oded Koren, and Nir Perel. Big data performance evaluation analysis using apache pig. International Journal of Software Engineering and Its Applications, 10(11):429–440, 2016.

  • [12] Johann Füller, Katja Hutter, Julia Hautz, and Kurt Matzler. User roles and contributions in innovation-contest communities. Journal of Management Information Systems, 31(1):273–308, 2014.

  • [13] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 20–43, Bolton Landing, NY, 2003.

  • [14] Shanshan Guo, Xitong Guo, Yulin Fang, and Doug Vogel. How doctors gain social and economic returns in online health-care communities: a professional capital perspective. Journal of Management Information Systems, 34(2):487–519, 2017.

  • [15] Bock Hans-Hermann. Origins and extensions of the k-means algorithm in cluster analysis. Journal Electronique dHistoire des Probabilités et de la Statistique Electronic Journal for History of Probability and Statistics, 4:48–49, 2008.

  • [16] Doug Henschen. Why sears is going all-in on hadoop. Information week. Retrieved July, 1:2014, 2012.

  • [17] Joshua Zhexue Huang, Michael K Ng, Hongqiang Rong, and Zichen Li. Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis & Machine Intelligence, (5):657–668, 2005.

  • [18] Zhexue Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery, 2(3):283–304, 1998.

  • [19] Cisco Visual Networking Index. The zettabyte era–trends and analysis. Cisco white paper, 2013.

  • [20] Anil K Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666, 2010.

  • [21] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis & Machine Intelligence, (7):881–892, 2002.

  • [22] Daniel Kendal, Oded Koren, and Nir Perel. Pig vs. hive use case analysis. International Journal of Database Theory and Application, 9(12):267–276, 2016.

  • [23] Oded Koren, Carina Antonia Hallin, Nir Perel, and Dror Bendet. Enhancement of the k-means algorithm for mixed data in big data platforms. In Proceedings of SAI Intelligent Systems Conference, pages 1025–1040. Springer, 2018.

  • [24] Sara Landset, Taghi M Khoshgoftaar, Aaron N Richter, and Tawfiq Hasanin. A survey of open source tools for machine learning with big data in the hadoop ecosystem. Journal of Big Data, 2(1):24, 2015.

  • [25] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela H Byers. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute, 2011.

  • [26] R Angelin Preethi and J Elavarasi. Big data analytics using hadoop tools, pache hive vs apache pig. International Journal of Emerging Technology in Computer Science & Electronics, 24(3), 2017.

  • [27] Arun Rai. Editor’s comments: Synergies between big data and theory. MIS quarterly, 40(2):iii–ix, 2016.

  • [28] Henri Ralambondrainy. A conceptual version of the k-means algorithm. Pattern Recognition Letters, 16(11):1147–1157, 1995.

  • [29] Alok R Saboo, V Kumar, and Insu Park. Using big data to model time-varying effects for marketing resource (re) allocation. MIS Quarterly, 40(4), 2016.

  • [30] Ohn Mar San, Van-Nam Huynh, and Yoshiteru Nakamori. An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science, 14:241–247, 2004.

  • [31] Prasanna Tambe. Big data investment, skills, and firm value. Management Science, 60(6):1452–1469, 2014.

  • [32] Tom White. Hadoop: The definitive guide. O’Reilly Media, Inc., 2012.

  • [33] Rui Xu and Donald C Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645–678, 2005.

OPEN ACCESS

Journal + Issues

Search