Ecosystem
Badrul Chowdhury, Tilmann Rabl, and Hans-Arno Jacobsen
Middleware Systems Research Group
University of Toronto badrul.chowdhury@mail.utoronto.ca, tilmann.rabl@utoronto.ca, jacobsen@eecg.toronto.edu http://msrg.org
Abstract. BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DI. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other other systems, the queries have to be translated.
In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized with
Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the different design choices we took and show a performance evaluation. 1
Introduction
Big data analytics is an ever growing field of research and business. Due to the drastic decrease of cost of storage and computation more and more data sources become profitable for data mining. A perfect example are online stores, while earlier online shopping systems would only record successful transactions, modern systems record every single interaction of a user with the website. The former allowed for simple basket analysis techniques, while current level of detail in monitoring makes detailed user modeling possible.
The growing demands on data management systems and the new forms of analysis have led to the development of a new breed of systems, big data management systems (BDMS). Similar to the advent of database management systems, there is a vastly growing ecosystem of diverse approaches. This leads to a dilemma for customers of BDMSs, since there are no realistic and proven measures to compare different offerings. To this
References: Proceedings of the ACM SIGMOD Conference. (2013) 2 Technical report, McKinsey Global Institute (2011) http://www.mckinsey.com/insights/mgi/research/technology_ and_innovation/big_data_the_next_frontier_for_innovation. 5. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51(1) (2008) 107–113 8 (2010) 1–10 7 Proceedings of the VLDB Endowment 2(2) (2009) 1626–1629 8 Communications in Computer and Information Science. Springer Berlin Heidelberg (2012) 220–234 2008. ICPADS ’08. 14th IEEE International Conference on. (Dec 2008) 11–18 11 of the Third and Fourth Workshop on Big Data Benchmarking 2013. (2014) (in print). International Symposium On High Performance Computer Architecture. HPCA (2014)