Performance Analysis of Two Big Data Technologies on a Cloud Distributed Architecture. Results for Non-Aggregate Queries on Medium-Sized Data

Open access

Abstract

Big Data systems manage and process huge volumes of data constantly generated by various technologies in a myriad of formats. Big Data advocates (and preachers) have claimed that, relative to classical, relational/SQL Data Base Management Systems, Big Data technologies such as NoSQL, Hadoop and in-memory data stores perform better. This paper compares data processing performance of two systems belonging to SQL (PostgreSQL/Postgres XL) and Big Data (Hadoop/Hive) camps on a distributed five-node cluster deployed in cloud. Unlike benchmarks in use (YCSB, TPC), a series of R modules were devised for generating random non-aggregate queries on different subschema (with increasing data size) of TPC-H database. Overall performance of the two systems was compared. Subsequently a number of models were developed for relating performance on the system and also on various query parameters such as the number of attributes in SELECT and WHERE clause, number of joins, number of processing rows etc.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Buhl H. U. Röglinger M. Moser F. and Heidemann J. 2013. Big Data. Business & Information Systems Engineering 5(2) 65-69. doi:

    • Crossref
    • Export Citation
  • Cattell R. 2010. Scalable SQL and NoSQL Data Stores. SIGMOD Record 39(4) 12-27. doi:

    • Crossref
    • Export Citation
  • Cogean D. I. Fotache M. and Greavu-Serban V. 2013. NoSQL in Higher Education. A Case Study. In C. Boja L. Batagan M. Doinea C. Ciurea P. Pocatilu A. Ion R. Magos L. Cotfas A. Velicanu C. Amancei M. Andreica and A. Zamfiroiu (Eds.) International Conference on Informatics in Economy (pp. 352-360). Bucharest: Bucharest Univ Economic Studies-Ase.

  • Cooper B. F. Silberstein A. Tam E. Ramakrishnan R. and Sears R. 2010. Benchmarking cloud serving systems with YCSB. Paper presented at the 1st ACM symposium on Cloud computing (published in the Proceedings) Indianapolis Indiana USA. doi:

    • Crossref
    • Export Citation
  • Doulkeridis C. and Norvag K. 2014. A survey of large-scale analytical query processing in MapReduce. The VLDB Journal 23(3) 355-380. doi:

    • Crossref
    • Export Citation
  • Faraway J. 2015. Linear Models with R (2nd ed. ed.). Boca Raton FL: CRC Press.

  • Fotache M. and Hrubaru I. 2016. Big Data Technology on Medium-Sized Data. Preliminary Results for Non-Aggregate Queries. In C. Boja M. Doinea C. Ciurea P. Pocatilu L. Batagan A. Velicanu M. E. Popescu I. Manafi A. Zamfiroiu and M. Zurini (Eds.) International Conference on Informatics in Economy Ie 2016: Education Research & Business Technologies (pp. 273-278). Bucharest: Bucharest Univ Economic Studies-Ase.

  • Fotache M. Strimbei C. Hrubaru I. and Cogean D. I. 2014. Scratching Big Data Surface: Comparing Simple Queries in PostgreSQL and MongoDB. Paper presented at the 13th International Conference on Informatics in Economy - IE 2014 (published in the Proceedings) Bucharest Romania.

  • Fox J. 2003. Effect Displays in R for Generalised Linear Models. Journal of Statistical Software 8(15) 1-27. doi:

    • Crossref
    • Export Citation
  • Fox J. 2016. Applied Regression Analysis and Generalized Linear Models (3rd ed. ed.). Thousand Oaks CA: Sage.

  • Fox J. and Weisberg S. 2011. An R Companion to Applied Regression (2nd ed. ed.). Thousand Oaks CA: Sage.

  • Giraudoux P. 2016. pgirmess: Data Analysis in Ecology. R package version 1.6.5. Retrieved from https://CRAN.R-project.org/package=pgirmess

  • Gross J. and Ligges U. 2015. nortest: Tests for Normality. R package version 1.0-4. Retrieved from https://CRAN.R-project.org/package=nortest

  • Hothorn T. and Hornik K. 2015. exactRankTests: Exact Distributions for Rank and Permutation Tests. R package version 0.8-28. Retrieved from https://cran.r-project.org/package=exactRankTests

  • Hrubaru I. and Fotache M. 2015. On a Hadoop Cliche: Physical and Logical Models Separation. In C. Boja M. Doinea C. Ciurea P. Pocatilu L. Batagan A. Ion V. Diaconita M. Andreica C. Delcea A. Zamfiroiu M. Zurini and O. Popescu (Eds.) Proceedings of the 14th International Conference on Informatics in Economy (pp. 357-363). Bucharest: Bucharest Univ Economic Studies-Ase.

  • Jacobs A. 2009. The pathologies of big data. Communications of the ACM 52(8) 36-44. doi:

    • Crossref
    • Export Citation
  • James G. Witten D. Hastie T. and Tibshirani R. 2014. An Introduction to Statistical Learning With Applications in R. New York NY: Springer.

  • Kejser T. 2014. TPC-H: Data And Query Generation. from http://kejser.org/tpc-h-data-and-querygeneration/

  • Kloke J. and McKean J. W. 2012. Rfit: Rank-based estimation for linear models. The R Journal 4(2) 57-64.

  • Kloke J. and McKean J. W. 2015. Nonparametric Statistical Methods Using R. Boca Raton FL: CRC Press.

  • Kowalczyk M. and Buxmann P. 2014. Big Data and Information Processing in Organizational Decision Processes. Business & Information Systems Engineering 6(5) 267-278. doi:

    • Crossref
    • Export Citation
  • Li F. Ooi B. C. Ozsu M. T. and Wu S. 2014. Distributed data management using MapReduce. ACM Computing Surveys 46(3) 1-42. doi:

    • Crossref
    • Export Citation
  • Lublinsky B. Smith K. and Yabukovich A. 2013. Professional Hadoop Solutions. Indianapolis IN: John Wiley & Sons.

  • Lungu I. and Tudorica B. G. 2013. The Development of a Benchmark Tool for NoSQL Databases. 4(2) 13-20.

  • Pavlo A. and Aslett M. 2016. What's Really New with NewSQL? SIGMOD Record 45(2) 45-55. doi:

    • Crossref
    • Export Citation
  • Pinheiro J. Bates D. DebRoy S. Sarkar D. EISPACK authors Heisterkamp S. . . . R-core team 2016. nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-128. Retrieved from http://CRAN.R-project.org/package=nlme

  • PostgresXL 2016. Postgres XL Overview. Retrieved 10 September 2016 from http://www.postgresxl.org/overview/

  • Sakr S. Liu A. and Fayoumi A. G. 2013. The family of mapreduce and large-scale data processing systems. ACM Computing Surveys 46(1) 1-44. doi:

    • Crossref
    • Export Citation
  • Solt F. Hu Y. and Kenke B. 2016. interplot: Plot the Effects of Variables in Interaction Terms. R package version 0.1.5. Retrieved from http://CRAN.R-project.org/package=interplot

  • Stonebraker M. 2012a. New opportunities for New SQL. Communications of the ACM 55(11) 10-11. doi:

    • Crossref
    • Export Citation
  • Stonebraker M. 2012b. What Does 'Big Data' Mean? . Communications of the ACM (BLOG@CACM).Retrieved 20 March 2016 from http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext

  • Stonebraker M. 2015. Hadoop at a Crossroads. Communications of the ACM 58(1) 18-19. doi:

    • Crossref
    • Export Citation
  • Thusoo A. Sarma J. S. Jain N. Shao Z. Chakka P. Anthony S. . . . Murthy R. 2009. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2) 1626-1629. doi:

    • Crossref
    • Export Citation
  • Trancoso P. 2015. Moving to memoryland: in-memory computation for existing applications. Paper presented at the Proceedings of the 12th ACM International Conference on Computing Frontiers Ischia Italy. doi:

    • Crossref
    • Export Citation
  • Transaction Processing Performance Council - TPC 2014. TPC Benchmark H Standard Specification Revision 2.17.1. 1-136. http://www.tpc.org/tpc_documents_current_versions/pdf/tpch_v2.17.1.pdf

  • Venables W. N. and Ripley B. D. 2002. Modern Applied Statistics with S (4th ed. ed.). New York: Springer. doi:

    • Crossref
    • Export Citation
  • Wei T. and Simko V. 2016. corrplot: Visualization of a Correlation Matrix. R package version 0.77. Retrieved from http://cran.r-project.org/web/packages/corrplot/index.html

  • White T. 2015. Hadoop - The Definitive Guide (4th ed.). Sebastopol CA: O'Reilly Media.

  • Wickham H. 2016. ggplot2: Elegant Graphics for Data Analysis. New York: Springer. doi:

    • Crossref
    • Export Citation
  • Ylijoki O. and Porras J. 2016. Perspectives to Definition of Big Data: A Mapping Study and Discussion. 4(1) 69-91.

  • Zeileis A. and Hothorn T. 2002. Diagnostic Checking in Regression Relationships. R News 2(3) 7-10.

Search
Journal information
Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 355 169 5
PDF Downloads 191 102 9