Preview

Mapreduce Case Study Solution

Good Essays
Open Document
Open Document
844 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Mapreduce Case Study Solution
MapReduce is a widely used parallel computing framework for large scale data processing. The two major performance metrics in MapReduce are job execution time and cluster throughput. They can be seriously impacted by straggler machines— machines on which tasks take an unusually long time to finish. Speculative execution is a common approach for dealing with the straggler problem by simply backing up those slow running tasks on alternative machines. Multiple speculative execution strategies have been proposed, but they have some pitfalls: i) Use average progress rate to identify slow tasks while in reality the progress rate can be unstable and misleading, ii) Cannot appropriately handle the situation when there exists data skew among the tasks, …show more content…
In a typical MapReduce job, the master divides the input files into multiple map tasks, and then schedules both map tasks and reduce tasks to worker nodes in a cluster to achieve parallel processing. When a machine takes an unusually long time to complete a task (the so-called straggler machine), it will delay the job execution time (the time from job initialized to job retired) and degrade the cluster throughput (the number of jobs completed per second in the cluster) significantly. This problem is handled via speculative execution—slow task is backed up on an alternative machine with the hope that the backup one can finish faster. Google simply backs up the last few running map or reduce tasks and has observed that speculative execution can decrease the job execution time by 44 percent [1]. Due to the significant performance gains, speculative execution is also implemented in Hadoop [2] and Microsoft Dryad [3] to deal with the straggler …show more content…
It is indeed able to do so, particularly for map and shuffle/merge phases. As shown in Fig., after a brief initialization period. For the correctness of the MapReduce programming model, it is necessary to ensure that the reduce phase does not start until the map phase is done for all data splits. However, the pipeline, as shown in Fig., contains an implicit serialization.
Fig.1 Serialization between shuffle/merge and reduce phases.
1.2.2 Repetitive Merges and Disk Access
Hadoop ReduceTasks merge data segments when the number of segments or their total size goes over a threshold. However, the current merge algorithm in Hadoop often leads to repetitive merges, thus extra disk accesses. Fig. shows a common sequence of merge operations in Hadoop.
Altogether, this means repetitive merges and disk access, causing degraded performance for Hadoop. Therefore, an alternative merge algorithm is critical for Hadoop to mitigate the impact of repetitive merges and extra disk accesses.
Fig 2. repetitive merges.
1.2.3 The Lack of Network

You May Also Find These Documents Helpful

  • Best Essays

    Nt1310 Unit 4 Exercise 1

    • 1486 Words
    • 6 Pages

    As it is evident from the related work discussed in the section 2, when small files are stored on HDFS, disk utilization is not a bottleneck. In general, small file problem occurs when memory of NameNode is highly consumed by the metadata and BlockMap of huge numbers of files. NameNode stores file system metadata in main memory and the metadata of one file takes about 250 bytes of memory. For each block by default three replicas are created and its metadata takes about 368 bytes [9]. Let the number of memory bytes that NameNode consumed by itself be denoted as α. Let the number of memory bytes that are consumed by the BlockMap be denoted as β. The size of an HDFS block is denoted as S. Further assume that there are N…

    • 1486 Words
    • 6 Pages
    Best Essays
  • Good Essays

    Then we start the MapReduce daemons: the JobTracker is started on master, and TaskTracker daemons are started on all slaves (here: master and slave).…

    • 1876 Words
    • 8 Pages
    Good Essays
  • Good Essays

    Reduce: Reduce step processes the data from the slave nodes and outputs from the map task serves as the input to reduce task and to form the final and ultimate output.…

    • 496 Words
    • 2 Pages
    Good Essays
  • Powerful Essays

    [4] Storage Conference. The Hadoop Distributed File System http://storageconference.org/ 2010/ Papers/ MSST/Shvachko.pdf [5] A Tutorial on Clustering Algorithms. K-Means Clustering http://home.dei.polimi.it/matteucc/ Clustering/ tutorial_html/kmeans.html [6] International Journal of Computer Science Issues. Setting up of an Open Source based Private Cloud http://ijcsi.org/papers/IJCSI-8-3-1-354-359.pdf [7] Eucalyptus. Modifying a prepackaged image http://open.eucalyptus.com/participate/wiki/modifyi ng-prepackaged-image [8] Michael G. Noll. Running Hadoop On Ubuntu Linux (Single-Node Cluster) http://www.michaelnoll.com/tutorials/running-hadoop-on-ubuntu-linuxsingle-node-cluster/ [9] 8K Miles Cloud Solutions. Hadoop: CDH3 – Cluster (Fully-Distributed) Setup http://cloudblog.8kmiles.com/2011/12/08/hadoopcdh3-cluster-fully-distributed-setup/ [10] Apache Mahout. Creating Vectors from Text https://cwiki.apache.org/MAHOUT/creatingvectors-from-text.html…

    • 3006 Words
    • 13 Pages
    Powerful Essays
  • Satisfactory Essays

    The productivity of the duplicate merging process is important to monitor to be sure there is no information…

    • 511 Words
    • 3 Pages
    Satisfactory Essays
  • Satisfactory Essays

    A group of MapReduce jobs G= {0, 1,……g} and a group of Task-Trackers SS = {0,1,…..s}. We also state m and SS to index into the sets of jobs and Task-Trackers. For each TaskTracker S we correlate a series of resources, P = {0,1,….p}. Every resource of Task-Tracker S contains a correlated capacity V. We also take into account the disk bandwidth, memory and CPU capacities for each TaskTracker and our algorithm is designed to contain other resources such as storage capacity. A MapReduce job, (m) contains a group of tasks, called as offering time, that can be shared into map tasks and reduce tasks. Each TaskTracker S gives the cluster a group of job-slots in which tasks can execute. Each job-slot is given a specific job, and the scheduler will…

    • 197 Words
    • 1 Page
    Satisfactory Essays
  • Best Essays

    IBM SUPERCOMPUTER, WATSON

    • 2209 Words
    • 9 Pages

    The ability to coordinate all of these processors into one functioning logarithmic unit required a group of engineers from IBM to develop a specialized kernel-based virtual machine implementation with the ability to process eighty Tera-flops per seconds . The software that allowed all of this to occur is called Apache Hadoop. Hadoop is an open source framework software that is used to organize and manage grid computing environments. Since the theoretical limit of processors with current technology is set at a central processing unit (CPU) clock speed of three giga-hertz, a software model to enhance parallel processing for supercomputers had to be developed. With the use of Hadoop the programmers at IBM were able to more easily write applications for Watson that benefitted and took advantage of parallel processing to increase the speed at which problems could be solved and questions could be answered. The main reason why this makes things faster is the fact that one question can be researched in multiple paths at one time using parallel processing paths…

    • 2209 Words
    • 9 Pages
    Best Essays
  • Good Essays

    Week 6 Discussion 2

    • 582 Words
    • 3 Pages

    Any organization wishing to maintain a competitive advantage can benefit from big data management and analytical tools. When properly utilized, big data can increase efficiency, productivity, and predict future market conditions (Laudon, p. 231). As processors become faster and more affordable, big data management will become a necessary component of all organizations. The actual benefit from big data will lie in the ability to analyze and apply the vast amounts of information that are flooding databases at all times.…

    • 582 Words
    • 3 Pages
    Good Essays
  • Powerful Essays

    Du Preez, D. (2012a). Big data: hands on or hands off? 21 Feb 2012. Computing Feature, (n.d.). Retrieved from http://www.computing.co.uk/ctg/feature/2153789/-hands-hands/page/1…

    • 1730 Words
    • 7 Pages
    Powerful Essays
  • Powerful Essays

    The compute framework of Hadoop is called Map Reduce. Map Reduce has been proven to the scale of…

    • 3076 Words
    • 13 Pages
    Powerful Essays
  • Good Essays

    Algorithm Scheduling

    • 2087 Words
    • 9 Pages

    Through the execution order used by the scheduling algorithm the TT (turnaround time; time taken for each process to complete) can be optimized, in that processing tasks with a smaller execution time first leads to better overall TT.…

    • 2087 Words
    • 9 Pages
    Good Essays
  • Good Essays

    The Normal Forms 3NF and BCNF Yunliang Jiang Housekeeping • HW2 due tonight – Upload a single PDF/DOC file to Compass • Stage 3 due tonight • Midterm tomorrow – During class time. Preview • • • • • • Normalization Solution: Normal Forms Introducing 3NF and BCNF 3NF Examples BCNF Normalization • Normalization is the process of efficiently organizing data in a database with two goals in mind • First goal: eliminate redundant data – for example, storing the same data in more than one table • Second Goal: ensure data dependencies make sense – for example, only storing related data in a table Benefits of Normalization • • • • • • Less storage space Quicker updates Less data inconsistency Clearer data relationships Easier to add data Flexible Structure…

    • 1137 Words
    • 16 Pages
    Good Essays
  • Better Essays

    Datacube Computation

    • 4285 Words
    • 18 Pages

    – Smallest-child: computing a cuboid from the smallest, previously computed cuboid – Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os – Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads – Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used – Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used…

    • 4285 Words
    • 18 Pages
    Better Essays
  • Good Essays

    A graph that deals with data divided into segments and operates on each segment simultaneously uses component parallelism…

    • 2771 Words
    • 12 Pages
    Good Essays
  • Good Essays

    hbase

    • 2218 Words
    • 24 Pages

    – JSF 2, PrimeFaces, servlets/JSP, Ajax, jQuery, Android development, Java 6 or 7 programming, custom mix of topics…

    • 2218 Words
    • 24 Pages
    Good Essays