Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
What is MapReduce Partitioner?
2.1.
Need of MapReduce Partitioner
2.2.
Poor Partitioning in MapReduce
2.3.
Number of Partitioners
3.
Implementation of MapReduce Partitioner
3.1.
Input Data
3.2.
Map Task
3.2.1.
Input
3.2.2.
Method
3.2.3.
Output
3.3.
Partitioner Task
3.3.1.
Input
3.3.2.
Method
3.3.3.
Output
3.4.
Reduce Task
3.4.1.
Input
3.4.2.
Method
3.4.3.
Output
4.
Frequently Asked Questions
4.1.
What is MapReduce used for?
4.2.
What is a partitioner?
4.3.
What is the difference between partitioner and combiner in MapReduce?
5.
Conclusion
Last Updated: May 1, 2024
Easy

MapReduce Partitioner

Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

We will look at the Partitioner in Hadoop. A Partitioner acts as a condition in the processing of an input dataset. The partitioning phase occurs after the Map phase but before the Reduce phase. MapReduce Partitioner allows for uniform distribution of Map output across the reduction. The Map output is partitioned based on the key and value.

MapReduce Partitioner

MapReduce partitioner is essential for simplifying the process of processing large volumes of data in parallel. Head over to the below article to have an in-depth discussion of MapReduce Partitioner.

What is MapReduce Partitioner?

In MapReduce job execution, the Partitioner controls the partitioning of the keys of the intermediate Map outputs. The partition is derived using the hash function and key (or a subset of the key). The sum of the partitions equals the number of Reduce tasks. Each mapper output is based on the key value and framework partitions. Records with the same key value are placed in the same partition (within each mapper). The partitions are then submitted to a reducer. The partition class determines a (key, value) pair's partition assignment. The partitioning phase in MapReduce data flow occurs after the Map phase and before the reduce phase.

Need of MapReduce Partitioner

An input data set is used to construct a list of key-value pairs during the execution of a MapReduce job. The Map phase produced this key-value pair. When the input data is divided, each task handles the division, and each Map creates a list of key-value pairs. The framework then delivers the Map output to the task Reduce phase. Map outputs are processed by using the user-defined reduce function. Before the reduce phase, the Map output partitioning is based on the key. 

According to Partitioning, each key's values are grouped. It ensures that every key's value is assigned to the same reducer. This enables the Map output to be distributed evenly across the reducer. The Partitioner in a MapReduce task routes the mapper output to the reducer by identifying which reducer handles the specific key.

Poor Partitioning in MapReduce

Suppose one key appears more than any other key in the data input to a MapReduce task. In this scenario, we employ two techniques to transfer data to the partition, which are as follows:

  • The key that appears most of the time will be transmitted to one partition.
     
  • All other keys will be routed to partitions based on their hashCode().
     

If the hashCode() function does not distribute additional key data over the partition range. The data will not be transmitted to the reducers as a result. 

Poor data partitioning means certain reducers will have more data input than others. More labor than other reducers will fall on their shoulders. As a result, the entire project must wait for one reducer to finish its extra-large share of the load.

We can construct a custom Partitioner to overcome MapReduce's poor Partitioner. This enables the spreading of burden among various reducers.

Number of Partitioners

The overall number of Partitioners is determined by the number of reducers specified by JobConf.setNumReduceTasks() method. Thus, the data from a single Partitioner is processed by a single reducer. It's important to note that the framework only builds Partitioners when there are numerous reducers.

Hash Partitioner is the default Partitioner. It calculates the key's hash value. Based on this outcome, it also assigns the partition.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Implementation of MapReduce Partitioner

Suppose we have a small table called Student that has the following information. This sample information will be used as our input dataset to show how the Partitioner works.

student_idstudent_namestudent_agestudent_genderstudent_marks
1Harry12Male85
2John18Male45
3Elizabeth23Female58
4Danial13Male94
5Helena45Female89
6Ronald34Male76
7Jamme29Male66
8Margery22Female88

We have to develop an application to analyze the input dataset in order to determine the student with the highest marks by gender in several student_age categories (for example, under 18, between 19 and 30, and above 30).

The algorithmic explanation of the Partitioner is given below based on the above input.

Input Data

In the context of MapReduce, input data refers to the dataset that needs to be processed using the MapReduce paradigm.

When discussing the implementation of a MapReduce partitioner specifically, the input data typically refers to the key-value pairs generated by the mapper function. In a MapReduce job, the input data is divided into chunks, and each chunk is processed by a mapper function. The mapper function generates intermediate key-value pairs based on the input data it receives.

Map Task

The map task accepts key-value pairs as input, and we have text data in a text file. The following is the input for this map task:

Input

A pattern like "any special key + filename + line number" would be the key. For example, key = @input1. The data in that line would be the value. For example, 1 \t Harry \t 12 \t Male \t 85.
 

Method

The following is how this map task works: 

  • In a string, read the value (record data) that comes as an input value from the argument list.
     
  • Separate the student_gender and save it in a string variable using the split method.
     
String[] str_array = value.toString().split("\t", -3);
String gender=str_array[3];

 

Send the student_gender information and the record data value from the map job to the partition task as an output key-value pair. 

context.write(new Text(student_gender), new Text(value));


Repeat the previous steps for each record in the text file.
 

Output

The gender data and the record data value will be returned as key-value pairs.

Partitioner Task

The partitioner task takes as input the key-value pairs from the map task. Partitioning data means separating it into segments. The input key-value paired data can be separated into three sections based on the student_age criteria using the given conditional criteria of partitions.

Input

The entire dataset is represented as a set of key-value pairs.

           key = student_gender field value in the record.

           value = Whole record data value of that gender.
 

Method

The process of partition logic runs as follows. 

  • Read the student_age field value from the key-value pair input.
     
String[] str = value.toString().split("\t");
int age = Integer.parseInt(str[2]);

 

  • Check the student_age value under the following conditions.

    1. student_age less than or equal to 18
       
    2. student_age greater than 18 and less than or equal to 30.
       
    3. student_age greater than 30.
       
if (student_age <= 18) {
    return 0;
} else if (student_age > 18 && student_age <= 30) {
    return 1 % numReduceTasks;
} else {
    return 2 % numReduceTasks;
}

 

Output

The entire key-value pair data set is divided into three sets of key-value pairs. The Reducer operates on each collection separately.

Reduce Task

The number of partitioner tasks equals the number of reducer tasks. We have three partitioner tasks and three Reducer tasks to run.

Input

The Reducer will run three times with a different set of key-value pairs each time. 

           key: It stores the student_gender field value in the record.

           value: It stores the whole record data of that gender.
 

Method

On each collection, the following logic will be applied: 

  • Read the student_marks field value of each record.
     
String [] str = val.toString().split("\t", -3);
Note: str[4] has the student_marks field value.

 

  • Check the sudent_marks with the maximum variable. If str[4] is the maximum student_marks, set it to the max; otherwise, skip this step.
     
if (Integer.parseInt(str[4]) > max) {
    max = Integer.parseInt(str[4]);
}

 

  • For each key collection (Male & Female are the key collections), repeat Steps 1 and 2. You will discover one maximum salary from the Male key collection and one maximum salary from the Female key collection after completing these three stages.
     
context.write(new Text(key), new IntWritable(max));

 

Output

Finally, you will receive a set of key-value pair data in three collections of varying ages. It includes the maximum salary from the Male collection and the maximum salary from the Female collection in each age group. After the completion of the Map, Partitioner, and Reduce processes, the three collections of key-value pair data are saved in three distinct files as the output.

Also check out - Phases of Compiler

Frequently Asked Questions

What is MapReduce used for?

MapReduce is used for parallel processing of large datasets, dividing tasks into map (data processing) and reduce (aggregation) phases for efficient analysis.

What is a partitioner?

A partitioner in MapReduce distributes intermediate key-value pairs generated by mappers to reducers, ensuring a balanced workload and efficient processing.

What is the difference between partitioner and combiner in MapReduce?

A partitioner distributes data to reducers, while a combiner is an optional optimization in MapReduce that aggregates intermediate data locally on mapper nodes.

Conclusion

This article explains the concepts of the MapReduce Partitioner, its needs, and its simplified implementation, along with some frequently asked questions related to the topic. Hope this article was beneficial and you learned something new. To have a better understanding of the topic, you can further refer to MapReduce Fundamentals and Hadoop MapReduce.

For more information, refer to our Guided Path on Coding Ninjas Studio to upskill yourself in PythonData Structures and AlgorithmsCompetitive ProgrammingSystem Design, and many more! 

Head over to our practice platform, Coding Ninjas Studio, to practice top problems, attempt mock tests, read interview experiences and interview bundles, follow guided paths for placement preparations, and much more! 

Happy Learning Ninja!

Previous article
Map Reduce Function And Its Optimization
Next article
Functional versus Procedural Programming Models
Live masterclass