A PROJECT REPORT
Submitted by
SHENBAGA PRIYA.B
09ITR105
SILAMBARASAN.R
09ITR108
VIGNESWARI.A
09ITR125 in partial fulfilment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY IN INFORMATION TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY SCHOOL OF COMMUNICATION AND COMPUTER SCIENCES
KONGU ENGINEERING COLLEGE
(Autonomous)
PERUNDURAI ERODE – 638 052
APRIL 2013
ABSTRACT
Data analysis is the process of inspecting, cleaning, transforming and modeling data with the goal of highlighting useful information, suggesting conclusions and supporting decision making, which is considerable in cloud computing which allows a large amount of data to be processed over very large clusters. MapReduce is used to handle data in the cloud environment especially in distributed environment because of its excellent scalability and good fault tolerance. But, compared to parallel databases, the efficiency of MapReduce is not efficient when it is adopted to perform complex data analysis which includes joining of multiple data sets in order to compute certain aggregates. A system called Map Join Reduce, which performs complex data analytical task effectively when compared to existing, is proposed. Filtering-join-aggregation model, an extension of MapReduce’s filtering aggregation programming model is introduced. First it performs filtering logic to the data sets and processed in pipelined manner, then groups the output and produces the final result. The significance of our proposal is that, aggregate multiple data sets in one go and thus reduce checkpoints which perform often in existing system and shuffling of intermediate results which results in efficiency of data processing in distributed applications.
INTRODUCTION
In Information Technology, big data is a collection of data sets which is too large and complex that it becomes difficult to process using
References: 1. Afrati.F.N and Ullman.J.D.(2010) ‘Optimizing Joins in a Map-Reduce Environment,’ Proc. 13th Int’l Conf. Extending Database Technology(EDBT ’10). 2. Chuck Lam. (2010) ‘Hadoop in action’, Manning publications. 3. Dawei Jiang, Anthony K. H. Tung, and Gang Chen. (2011) ‘MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters’, IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 9. 4. Dean.J and Ghemawat.S. (2004) ‘MapReduce: Simplified Data Processing on Large Clusters,’ Proc. Operating Systems Design and implementation (OSDI), pp. 137-150. 5. Yang.H.C, Dasdan.A, HsiaoR.L, and Parker.D.S. (2007) ‘Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters,’ Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’07).