Mapreduce & Hadoop

Read Complete Research Material



[MapReduce & Hadoop]

By

ACKNOWLEDGEMENT

I would take this opportunity to thank my research supervisor, family and friends for their support and guidance without which this research would not have been possible.

DECLARATION

I, [type your full first names and surname here], declare that the contents of this dissertation/thesis represent my own unaided work, and that the dissertation/thesis has not previously been submitted for academic examination towards any qualification. Furthermore, it represents my own opinions and not necessarily those of the University.

Signed __________________ Date _________________

TABLE OF CONTENTS

ACKNOWLEDGEMENT2

DECLARATION3

1. INTRODUCTION5

2. OVERVIEW7

2.1 Store data fail safe9

2.1.1 How to increase the reading time: Parallel access:10

2.1.2 What about hardware failure?10

2.2 Process the data faster10

2.2.1 Seek time10

2.2.2 Processing semi structure data11

2.2.3 Normalization11

2.2.4 Linear Scalability11

2.3 What is Hadoop?12

2.3.1 HDFS13

3. BACKGROUND14

3.1 Serial vs. Parallel Programming15

3.2 The Basics16

3.3 What is MapReduce?20

3.4 MapReduce Execution Overview21

3.5 MapReduce Examples24

4. DESIGN AND IMPLEMENTATION25

4.1 Map/Reduce Functions25

4.2 DFS: Distributed File System27

4.3 Job Execution29

5. SYSTEM DESIGN OF HADOOP31

5.1 MPMD Extension Architecture31

5.2 Synchronization Extension Architecture32

6. IMPLEMENTATION33

6.1 MPMD Extension33

6.2 Synchronization Extension34

6.3 Performance Test35

6.3.1 Cluster Configuration36

6.3.2 Sort36

REFERENCES39

TABLE OF FIGURES

Figure 1 : Map Reduce Dataflow10

Figure 2: MapReduce Execution Overview23

Figure 3 : Representation of Map/Reduce DAG28

Figure 4 : Representation of Map/Reduce functions with DFS28

Figure 5 : HDFS Architecture29

Figure 6 : Hadoop Job Execution31

Figure 7: MPMD Extension Architecture32

Figure 8: Synchronization Extension Architecture33

Figure 9: Implementation for MPMD Extension34

Figure 10: Performance of running WordCount togerht with HadoopBlast using MPMD framework36

Figure 11 : Performance of MapReduce38

INTRODUCTION

The ever increasing amount of data over the years and the diversity of computations involved have decreased the response time of applications i.e. it takes much time to process datasets with vast amount of data. To sustain a low response time, the applications need to process that large datasets in lesser amount of time. MapReduce is a programming Model used for processing and generating large datasets. The model is based on the concept of parallel programming [17]. In parallel program, the processing is broken up into parts, each of which can be executed concurrently. The instructions from each part run simultaneously on different CPUs. These CPUs can exist on a single machine, or they can be CPUs in a set of computers connected via a network.

The MapReduce models derives from the map and reduce combinators from a functional language like Lisp. A MapReduce Job usually splits the input datasets into smaller chunks which are processed by the map task in a completely parallel manner [8]. The outputs obtained from all the map tasks are then sorted by the framework which are then input(s) to the reduce tasks. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key [12].

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines [14]. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set ...
Related Ads