MapReduce
What is MapReduce?
It is a processing technique and programming model which is used for distributed parallel processing.
This algorithm contains two task: Mapper and Reducer, where mapper does splitting and mapping of data while Reducer does shuffling and reducing the data.
Data goes through the following phases in MapReduce technique:
1. Input Split - The input data can be in the form of file or directory and is stored in HDFS. When job is submitted in Hadoop, it splits the input data equal units called chunks. Hadoop consists of a RecordReader that uses TextInputFormat function to transform input splits into key-value pairs.
2. Mapping - In this step, mapper processes the key-value pairs and produces output of same format.
3. Shuffling - Shuffling contains two phases: Sorting and merging. In sorting step, the key-value pairs are sorted using the keys. Merging ensures that key-value pairs are combined.The shuffling phase makes easier the removal of duplicate values and the grouping of values.
4. Reducing - The reducer processes the input further to reduce the intermediate values into smaller values.
Comments
Post a Comment
If you have any doubts, Please let me know