What is MapReduce?

It is a processing technique and programming model which is used for distributed parallel processing.

This algorithm contains two task: Mapper and Reducer, where mapper does splitting and mapping of data while Reducer does shuffling and reducing the data.

Data goes through the following phases in MapReduce technique:

1. Input Split - The input data can be in the form of file or directory and is stored in HDFS. When job is submitted in Hadoop, it splits the input data equal units called chunks. Hadoop consists of a RecordReader that uses TextInputFormat function to transform input splits into key-value pairs.

2. Mapping - In this step, mapper processes the key-value pairs and produces output of same format.

3. Shuffling - Shuffling contains two phases: Sorting and merging. In sorting step, the key-value pairs are sorted using the keys. Merging ensures that key-value pairs are combined.The shuffling phase makes easier the removal of duplicate values and the grouping of values.

4. Reducing - The reducer processes the input further to reduce the intermediate values into smaller values.

Big Data

Big Data Tutorial

MapReduce

What is MapReduce?

Data goes through the following phases in MapReduce technique:

Comments

Post a Comment

Popular posts from this blog

Hadoop

Q/A (Big Data) 2

Linux Commands