Table of contents
1.
Introduction
2.
What is the MapReduce Combiner?
2.1.
Working on MapReduce Combiner
2.2.
MapReduce Components and MapReduce Combiner Implementation
2.3.
Code Implementation of MapReduce Combiner
2.4.
Java
2.5.
Advantages of Combiners
2.6.
Disadvantages of Combiners 
3.
Frequently Asked Questions
3.1.
What is the function of the combiner?
3.2.
What is the difference between reducer and combiner?
3.3.
What is the purpose of the combiner function in MapReduce?
3.4.
What is a partitioner?
4.
Conclusion
Last Updated: Apr 24, 2024
Easy

MapReduce Combiner

Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

MapReduce is a programming style used in Hadoop for processing massive data volumes across distributed computers. The Map and Reduce phases are the two most crucial components of any MapReduce task. The Combiner is also known as a Mini-Reducer since it summarises the Mapper output record with the same Key before passing it to the Reducer. 

MapReduce Combiner

The MapReduce Combiner is a Hadoop framework feature that enables more efficient data processing by aggregating intermediate data before delivering it to the reducers. Head over to the given blog for a detailed explanation of how MapReduce Combiner works.

What is the MapReduce Combiner?

When we perform a MapReduce task on a large dataset, the Mapper generates significant amounts of intermediate data. The Reducer receives this intermediate data and processes it further. The Hadoop Combiner function, part of the MapReduce framework, is essential for easing network congestion. The Combiner in MapReduce is also known as Mini-Reducer. Before delivering the output data from the Mapper to the Reducer, the Combiner's primary responsibility is to process it. It is optional and runs after the Mapper and before the Reducer.

There is no separate interface for the Combiner; instead, it must implement the Reducer interface, and the reduce() method of the Combiner is called for each output key from the map. The input and output key-value types for the reduce() method of the combiner class must match those of the reducer class.

Read about: MapReduce PartitionerMapReduce Types

Working on MapReduce Combiner

Since the MapReduce Combiner lacks a predefined interface, it must implement a reducer interface method. Each output from a map with a key is processed by a combiner. Therefore, outputs with similar keys that are similar in value should be handled by the Reducer class. Due to the fact that it replaces the map's original output data, the combiner can produce summary information even with a large dataset. When a MapReduce job is run on a large dataset, the map class generates a sizable piece of intermediate data. It is then handed to the reducer for further processing, which will cause significant network congestion.  

Without the combiner, the MapReduce program overview looks somewhat like this:

MapReduce without combiner

In the above diagram, there is no combiner. The input is divided in half into two map classes or mappers, and the keys are generated by the mappers in batches of 9. Now that we have the intermediate data, which consists of nine key-value pairs. The mapper delivers this information to the reduction class or It requires some network bandwidth to send the data to the reducer. Bandwidth is the amount of time it takes to move data from one machine to another. If the size of the data is too large, time increases significantly while the data is being transferred.

We have a combiner in Hadoop, that shuffles intermediate data before sending it to the reducer and produces the result as four key-value pairs. Using a combiner, there are only two. Look below to find out how.

The MapReduce program outline using the combiner looks something like this:

MapReduce using Combiner

Currently, the reducer is only processing the four key-value pairs that come from the input of the two combiners. Only four ties are required for the reducer to be completed in order to produce the final result, which improves overall speed.

MapReduce Components and MapReduce Combiner Implementation

The implementation of MapReduce components is shown below:

  • Map Phase: The map phase divides the input data into two sections. They are as follows: Keys and Values. The Key is writable and comparable in the processing stage whereas only in the processing stage, Value is writable. Assume a client provides input data to a Hadoop system, and the task tracker is allocated tasks by the job tracker. The reducer code sets input as the combiner, which is also known as a little reducer. When a large amount of data is required, network bandwidth is high. The default partition is hash. The partition module is essential in Hadoop. More performance is provided by lessening the petitioner's pressure on the reducer.
     
  • Processing at the Intermediate Stage: The Map input enters the sort and shuffle phase during the intermediate phase. Hadoop nodes lack replication; therefore, all intermediate data is stored in a local file system. Hadoop uses round-robin data to write intermediate data to local discs. Other shuffle and sort criteria must be addressed before the data may be written to local discs.
     
  • Reducer Phase: Reducer accepts sorted and shuffled data as input. All input data will be merged, and similar key-value pairs will be written to the HDFS system. A reducer is not necessarily required for searching and mapping. Setting various properties that allow you to choose the number of reducers for each task. Speculative execution is very important during job processing.
  • Combiner Phase: The Combiner phase is an optional optimization step in the MapReduce framework, aimed at reducing the amount of data shuffled across the network during the Shuffle and Sort phase. The Combiner is similar to a Reducer but operates locally on the output of each Mapper before data is shuffled. It aggregates or combines intermediate key-value pairs with the same key, reducing the volume of data that needs to be transferred over the network to the Reducers. This optimization can significantly improve the overall efficiency and performance of MapReduce jobs, particularly in scenarios where the output of the Mapper generates a large number of intermediate key-value pairs.
     
  • Record Writer: The Record Writer is responsible for writing the output of the MapReduce job to the desired output location, such as a file system or database. It receives the final output key-value pairs produced by the Reducer phase and serializes them into a suitable format for storage or further processing. The Record Writer ensures that the output data is persisted reliably and efficiently, following the specified output format and organization.
     
  • Record Reader: The Record Reader, on the other hand, is responsible for reading input data and converting it into a format suitable for processing by the Mapper phase. It abstracts the details of data retrieval and parsing, allowing the Mapper to focus on processing the data itself. The Record Reader typically interacts with the underlying storage system, such as a file system or database, to fetch input data and present it to the Mapper as a stream of key-value pairs or records. It plays a crucial role in enabling MapReduce jobs to efficiently process large-scale datasets distributed across multiple nodes in a cluster.

Code Implementation of MapReduce Combiner

Consider that we have the following MapReduce input text file named input.txt for MapReduce.

John is a cute and simple child
John is the only child of his parent

 

  • Java

Java

import java.io.IOException;
import java.util.StringTokenizer;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;


import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;


import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
  {
     private final static IntWritable one = new IntWritable(1);
     private Text word = new Text();
    
     public void map(Object key, Text value, Context context) throws IOException, InterruptedException
     {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens())
        {
           word.set(itr.nextToken());
           context.write(word, one);
        }
     }
  }
 
  public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>
  {
     private IntWritable result = new IntWritable();
     public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
     {
        int sum = 0;
        for (IntWritable val : values)
        {
           sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
     }
  }
 
  public static void main(String[] args) throws Exception
  {
     Configuration conf = new Configuration();
     Job job = Job.getInstance(conf, "word count");

     job.setJarByClass(WordCount.class);
     job.setMapperClass(TokenizerMapper.class);
     job.setCombinerClass(IntSumReducer.class);
     job.setReducerClass(IntSumReducer.class);

     job.setOutputKeyClass(Text.class);
     job.setOutputValueClass(IntWritable.class);

     FileInputFormat.addInputPath(job, new Path(args[0]));
     FileOutputFormat.setOutputPath(job, new Path(args[1]));

     System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}
You can also try this code with Online Java Compiler
Run Code

Output:

John   2
is   2
child   2
a   1
cute   1
and   1
simple   1
the   1
only   1
of   1
his   1
parent   1


The above code generates the count of each word present in the input file with the help of MapReduce Combiner.

Advantages of Combiners

The advantages of Combiners are given below:

  • The Hadoop Combiner speeds up the data transfer process between the mapper and reducer.
     
  • The combiner improves the overall performance of the reducer by reducing network congestion.
     
  • By reducing the amount of data that needs to be transferred between the Mapper and Reducer, combiners can help in making MapReduce operations more scalable.
     
  • By conducting some preliminary data processing before the data is submitted to the Reducer, combiners can be utilized to optimize MapReduce operations.

Disadvantages of Combiners 

The disadvantages of Combiners are discussed as follows:

  • As there is no assurance that the Hadoop combiner will run, MapReduce tasks cannot rely on it.
     
  • The key-value pairs are kept in the local filesystem in Hadoop, and when the combiner is executed later, it will result in expensive disc IO.
     
  • Combiners can raise the resource consumption of MapReduce tasks since they need more CPU and memory to run their operations. 
     
  • Combiners can make MapReduce tasks more complex since they call for the implementation of additional logic.

Frequently Asked Questions

What is the function of the combiner?

The Combiner aggregates intermediate key-value pairs locally on each Mapper node to reduce data shuffled across the network.

What is the difference between reducer and combiner?

The Reducer processes shuffled and sorted data across the cluster, while the Combiner operates locally on Mapper outputs.

What is the purpose of the combiner function in MapReduce?

The Combiner optimizes MapReduce jobs by reducing data transfer overhead, aggregating data before shuffling to Reducers for efficiency.

What is a partitioner?

Partitioner is a crucial phase that uses a hash function to regulate the partitioning of the intermediate map and reduce output keys. The partitioning procedure selects which reducer receives a key-value pair (from the map output). The total number of reduce jobs for the process is equal to the number of partitions.

Conclusion

This article explains the concepts of the MapReduce combiner, its working, its advantages, and disadvantages, along with some frequently asked questions related to the topic. Hope this article was beneficial and you learned something new. To have a better understanding of the topic, you can further refer to MapReduce Fundamentals and Hadoop MapReduce.

For more information, refer to our Guided Path on Coding Ninjas Studio to upskill yourself in PythonData Structures and AlgorithmsCompetitive ProgrammingSystem Design, and many more! 

Head over to our practice platform, Coding Ninjas Studio, to practice top problems, attempt mock tests, read interview experiences and interview bundles, follow guided paths for placement preparations, and much more! 

Happy Learning Ninja!

Live masterclass