Preview

BigBench in Hadoop Ecosystem

Powerful Essays
Open Document
Open Document
6193 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
BigBench in Hadoop Ecosystem
A BigBench Implementation on the Hadoop
Ecosystem
Badrul Chowdhury, Tilmann Rabl, and Hans-Arno Jacobsen
Middleware Systems Research Group
University of Toronto badrul.chowdhury@mail.utoronto.ca, tilmann.rabl@utoronto.ca, jacobsen@eecg.toronto.edu http://msrg.org

Abstract. BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DI. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other other systems, the queries have to be translated.
In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized with
Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the different design choices we took and show a performance evaluation. 1

Introduction

Big data analytics is an ever growing field of research and business. Due to the drastic decrease of cost of storage and computation more and more data sources become profitable for data mining. A perfect example are online stores, while earlier online shopping systems would only record successful transactions, modern systems record every single interaction of a user with the website. The former allowed for simple basket analysis techniques, while current level of detail in monitoring makes detailed user modeling possible.
The growing demands on data management systems and the new forms of analysis have led to the development of a new breed of systems, big data management systems (BDMS). Similar to the advent of database management systems, there is a vastly growing ecosystem of diverse approaches. This leads to a dilemma for customers of BDMSs, since there are no realistic and proven measures to compare different offerings. To this



References: Proceedings of the ACM SIGMOD Conference. (2013) 2 Technical report, McKinsey Global Institute (2011) http://www.mckinsey.com/insights/mgi/research/technology_ and_innovation/big_data_the_next_frontier_for_innovation. 5. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51(1) (2008) 107–113 8 (2010) 1–10 7 Proceedings of the VLDB Endowment 2(2) (2009) 1626–1629 8 Communications in Computer and Information Science. Springer Berlin Heidelberg (2012) 220–234 2008. ICPADS ’08. 14th IEEE International Conference on. (Dec 2008) 11–18 11 of the Third and Fourth Workshop on Big Data Benchmarking 2013. (2014) (in print). International Symposium On High Performance Computer Architecture. HPCA (2014)

You May Also Find These Documents Helpful

  • Good Essays

    ECO 204 Week 3 DQ 1

    • 424 Words
    • 2 Pages

    A perfectly competitive industry is initially in a short-run equilibrium in which all firms are earning zero economic profits but are operating below their minimum efficient scale. Explain the long-run adjustments that will create equilibrium with firms operating at their minimum efficient scale. Why is a perfect competitive firm associated with efficiency for both consumers and businesses? Respond to at least two of your fellow students…

    • 424 Words
    • 2 Pages
    Good Essays
  • Powerful Essays

    [4] Storage Conference. The Hadoop Distributed File System http://storageconference.org/ 2010/ Papers/ MSST/Shvachko.pdf [5] A Tutorial on Clustering Algorithms. K-Means Clustering http://home.dei.polimi.it/matteucc/ Clustering/ tutorial_html/kmeans.html [6] International Journal of Computer Science Issues. Setting up of an Open Source based Private Cloud http://ijcsi.org/papers/IJCSI-8-3-1-354-359.pdf [7] Eucalyptus. Modifying a prepackaged image http://open.eucalyptus.com/participate/wiki/modifyi ng-prepackaged-image [8] Michael G. Noll. Running Hadoop On Ubuntu Linux (Single-Node Cluster) http://www.michaelnoll.com/tutorials/running-hadoop-on-ubuntu-linuxsingle-node-cluster/ [9] 8K Miles Cloud Solutions. Hadoop: CDH3 – Cluster (Fully-Distributed) Setup http://cloudblog.8kmiles.com/2011/12/08/hadoopcdh3-cluster-fully-distributed-setup/ [10] Apache Mahout. Creating Vectors from Text https://cwiki.apache.org/MAHOUT/creatingvectors-from-text.html…

    • 3006 Words
    • 13 Pages
    Powerful Essays
  • Good Essays

    Cis 515week 3

    • 1024 Words
    • 4 Pages

    Bibliography: (2012). Database systems: Design, implementation, and management. (10 ed.). United States of America: Joe Sabatino.…

    • 1024 Words
    • 4 Pages
    Good Essays
  • Powerful Essays

    References: Brown, B., Chiu, M., Manyika, J. (2011), Are you ready for the era of big data? Retrieved…

    • 1755 Words
    • 6 Pages
    Powerful Essays
  • Good Essays

    References: Coronel, C., Morris, S., & Rob, P. (2013). Database systems: Design, implementation, and management (10th ed.). Independence, KY: Cengage.…

    • 906 Words
    • 3 Pages
    Good Essays
  • Satisfactory Essays

    A group of MapReduce jobs G= {0, 1,……g} and a group of Task-Trackers SS = {0,1,…..s}. We also state m and SS to index into the sets of jobs and Task-Trackers. For each TaskTracker S we correlate a series of resources, P = {0,1,….p}. Every resource of Task-Tracker S contains a correlated capacity V. We also take into account the disk bandwidth, memory and CPU capacities for each TaskTracker and our algorithm is designed to contain other resources such as storage capacity. A MapReduce job, (m) contains a group of tasks, called as offering time, that can be shared into map tasks and reduce tasks. Each TaskTracker S gives the cluster a group of job-slots in which tasks can execute. Each job-slot is given a specific job, and the scheduler will…

    • 197 Words
    • 1 Page
    Satisfactory Essays
  • Good Essays

    References: Coronel, C. (2013). Database Systems: Design, Implementation, and Management, Tenth Edition. Mason, Ohio, United States: Cengage Learning.…

    • 799 Words
    • 3 Pages
    Good Essays
  • Good Essays

    Athabasca Assignment

    • 837 Words
    • 4 Pages

    Relational databases are not new technology. Commercially, they gained importance in the early 1980s with the introduction of Oracle’s relational database, and since then they have been an essential tool for most businesses. Databases are critical tools that help to support various business functions in an organization. These information systems help a business to build and maintain competitive advantage. Databases not only support the operational levels of business—they are also used to support the activities of managers.…

    • 837 Words
    • 4 Pages
    Good Essays
  • Good Essays

    Antigone is basically the beginning and the end of the playwright. From the prologue to the final scene, the play hinges on her acts of defiance or heroism. In the beginning of the play, Antigone is basically the ugly sister in regards to Ismene. Antigone is kind of grossly described to be honest. She’s small, scrawny, and practically a spoiled little girl.…

    • 839 Words
    • 4 Pages
    Good Essays
  • Powerful Essays

    Du Preez, D. (2012a). Big data: hands on or hands off? 21 Feb 2012. Computing Feature, (n.d.). Retrieved from http://www.computing.co.uk/ctg/feature/2153789/-hands-hands/page/1…

    • 1730 Words
    • 7 Pages
    Powerful Essays
  • Powerful Essays

    The compute framework of Hadoop is called Map Reduce. Map Reduce has been proven to the scale of…

    • 3076 Words
    • 13 Pages
    Powerful Essays
  • Satisfactory Essays

    research paper

    • 329 Words
    • 2 Pages

    Zemke, F. (2012, MARCH). What 's new in SQL:2011. Retrieved September 2012, from www.sigmod.org: http://www.sigmod.org/publications/sigmod-record/1203/pdfs/10.industry.zemke.pdf…

    • 329 Words
    • 2 Pages
    Satisfactory Essays
  • Good Essays

    Case Study Big Data

    • 923 Words
    • 4 Pages

    Volvo separated from Ford in 2010, it was breaking free from an IT infrastructure that consisted of a tangle of different systems and licenses. The need was there to develop a new standalone IT infrastructure that could provide better Business Intelligence, boost communication capabilities and enrich collaborations. It will be explained how The Volvo Car Corporation transformed data into knowledge, how they integrated cloud infrastructure into its networks and the advantage The Big Data Theory gives to Volvo Car Corporation.…

    • 923 Words
    • 4 Pages
    Good Essays
  • Best Essays

    Davenport, T. H., Barth, P., & Bean, R. (2012). How 'Big Data ' is different. MIT Sloan…

    • 2200 Words
    • 9 Pages
    Best Essays
  • Best Essays

    Data Warehousing and Olap

    • 2507 Words
    • 11 Pages

    In the 1990s, as businesses grew more complex, corporation spread globally, and competition became fiercer, business executives became desperate for information to stay competitive and improve the bottom line. Data warehousing technologies have been successfully deployed in many industries: manufacturing (for order shipment and customer support), retail (for user profiling and inventory management), financial services (for claims analysis, risk analysis, credit card analysis, and fraud detection), transportation (for fleet management), telecommunications (for call analysis and fraud detection), utilities (for power usage analysis), and healthcare (for outcomes analysis). This paper presents a roadmap of data warehousing technologies, focusing on the special requirements that data warehouses place on database management systems (DBMSs).…

    • 2507 Words
    • 11 Pages
    Best Essays

Related Topics