Introduction
MapReduce is a programming style used in Hadoop for processing massive data volumes across distributed computers. The Map and Reduce phases are the two most crucial components of any MapReduce task. The Combiner is also known as a Mini-Reducer since it summarises the Mapper output record with the same Key before passing it to the Reducer.

The MapReduce Combiner is a Hadoop framework feature that enables more efficient data processing by aggregating intermediate data before delivering it to the reducers. Head over to the given blog for a detailed explanation of how MapReduce Combiner works.
What is the MapReduce Combiner?
When we perform a MapReduce task on a large dataset, the Mapper generates significant amounts of intermediate data. The Reducer receives this intermediate data and processes it further. The Hadoop Combiner function, part of the MapReduce framework, is essential for easing network congestion. The Combiner in MapReduce is also known as Mini-Reducer. Before delivering the output data from the Mapper to the Reducer, the Combiner's primary responsibility is to process it. It is optional and runs after the Mapper and before the Reducer.
There is no separate interface for the Combiner; instead, it must implement the Reducer interface, and the reduce() method of the Combiner is called for each output key from the map. The input and output key-value types for the reduce() method of the combiner class must match those of the reducer class.
Read about: MapReduce Partitioner, MapReduce Types
Working on MapReduce Combiner
Since the MapReduce Combiner lacks a predefined interface, it must implement a reducer interface method. Each output from a map with a key is processed by a combiner. Therefore, outputs with similar keys that are similar in value should be handled by the Reducer class. Due to the fact that it replaces the map's original output data, the combiner can produce summary information even with a large dataset. When a MapReduce job is run on a large dataset, the map class generates a sizable piece of intermediate data. It is then handed to the reducer for further processing, which will cause significant network congestion.
Without the combiner, the MapReduce program overview looks somewhat like this:

In the above diagram, there is no combiner. The input is divided in half into two map classes or mappers, and the keys are generated by the mappers in batches of 9. Now that we have the intermediate data, which consists of nine key-value pairs. The mapper delivers this information to the reduction class or It requires some network bandwidth to send the data to the reducer. Bandwidth is the amount of time it takes to move data from one machine to another. If the size of the data is too large, time increases significantly while the data is being transferred.
We have a combiner in Hadoop, that shuffles intermediate data before sending it to the reducer and produces the result as four key-value pairs. Using a combiner, there are only two. Look below to find out how.
The MapReduce program outline using the combiner looks something like this:

Currently, the reducer is only processing the four key-value pairs that come from the input of the two combiners. Only four ties are required for the reducer to be completed in order to produce the final result, which improves overall speed.
MapReduce Components and MapReduce Combiner Implementation
The implementation of MapReduce components is shown below:
-
Map Phase: The map phase divides the input data into two sections. They are as follows: Keys and Values. The Key is writable and comparable in the processing stage whereas only in the processing stage, Value is writable. Assume a client provides input data to a Hadoop system, and the task tracker is allocated tasks by the job tracker. The reducer code sets input as the combiner, which is also known as a little reducer. When a large amount of data is required, network bandwidth is high. The default partition is hash. The partition module is essential in Hadoop. More performance is provided by lessening the petitioner's pressure on the reducer.
-
Processing at the Intermediate Stage: The Map input enters the sort and shuffle phase during the intermediate phase. Hadoop nodes lack replication; therefore, all intermediate data is stored in a local file system. Hadoop uses round-robin data to write intermediate data to local discs. Other shuffle and sort criteria must be addressed before the data may be written to local discs.
- Reducer Phase: Reducer accepts sorted and shuffled data as input. All input data will be merged, and similar key-value pairs will be written to the HDFS system. A reducer is not necessarily required for searching and mapping. Setting various properties that allow you to choose the number of reducers for each task. Speculative execution is very important during job processing.
-
Combiner Phase: The Combiner phase is an optional optimization step in the MapReduce framework, aimed at reducing the amount of data shuffled across the network during the Shuffle and Sort phase. The Combiner is similar to a Reducer but operates locally on the output of each Mapper before data is shuffled. It aggregates or combines intermediate key-value pairs with the same key, reducing the volume of data that needs to be transferred over the network to the Reducers. This optimization can significantly improve the overall efficiency and performance of MapReduce jobs, particularly in scenarios where the output of the Mapper generates a large number of intermediate key-value pairs.
-
Record Writer: The Record Writer is responsible for writing the output of the MapReduce job to the desired output location, such as a file system or database. It receives the final output key-value pairs produced by the Reducer phase and serializes them into a suitable format for storage or further processing. The Record Writer ensures that the output data is persisted reliably and efficiently, following the specified output format and organization.
- Record Reader: The Record Reader, on the other hand, is responsible for reading input data and converting it into a format suitable for processing by the Mapper phase. It abstracts the details of data retrieval and parsing, allowing the Mapper to focus on processing the data itself. The Record Reader typically interacts with the underlying storage system, such as a file system or database, to fetch input data and present it to the Mapper as a stream of key-value pairs or records. It plays a crucial role in enabling MapReduce jobs to efficiently process large-scale datasets distributed across multiple nodes in a cluster.
Code Implementation of MapReduce Combiner
Consider that we have the following MapReduce input text file named input.txt for MapReduce.
John is a cute and simple child
John is the only child of his parent
Output:
John 2
is 2
child 2
a 1
cute 1
and 1
simple 1
the 1
only 1
of 1
his 1
parent 1
The above code generates the count of each word present in the input file with the help of MapReduce Combiner.
Advantages of Combiners
The advantages of Combiners are given below:
-
The Hadoop Combiner speeds up the data transfer process between the mapper and reducer.
-
The combiner improves the overall performance of the reducer by reducing network congestion.
-
By reducing the amount of data that needs to be transferred between the Mapper and Reducer, combiners can help in making MapReduce operations more scalable.
- By conducting some preliminary data processing before the data is submitted to the Reducer, combiners can be utilized to optimize MapReduce operations.
Disadvantages of Combiners
The disadvantages of Combiners are discussed as follows:
-
As there is no assurance that the Hadoop combiner will run, MapReduce tasks cannot rely on it.
-
The key-value pairs are kept in the local filesystem in Hadoop, and when the combiner is executed later, it will result in expensive disc IO.
-
Combiners can raise the resource consumption of MapReduce tasks since they need more CPU and memory to run their operations.
- Combiners can make MapReduce tasks more complex since they call for the implementation of additional logic.