Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Hadoop Schedulers
3.
Types of Job Scheduling in MapReduce
3.1.
FIFO Scheduler
3.2.
Capacity Scheduler
3.3.
Fair Scheduler
4.
When to use Each Job Scheduling in MapReduce
5.
Importance of using Hadoop Schedulers
6.
Limitations of Job Scheduling in MapReduce
7.
Frequently Asked Questions
7.1.
What is job scheduling?
7.2.
What are the types of job scheduling in MapReduce?
7.3.
What is a Job Queue?
8.
Conclusion
Last Updated: Mar 27, 2024
Medium

Job Scheduling in MapReduce

Author yuvatimankar
0 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

MapReduce is a framework that helps us to write applications to process large amounts of data in parallel on huge clusters of commodity hardware in an authentic manner.

The MapReduce algorithm consists of two essential tasks, Map & Reduce.

Job Scheduling in MapReduce

The Map takes a set of data and turns it into a different set of data, in which individual elements are divided into tuples. While Reduce takes the output from a map as an input and merges those data tuples into a smaller set of tuples. As the name suggests, the Map job is done before the Reduce task. In this article, we will discuss job scheduling in MapReduce, so let's get started!

Hadoop Schedulers

Hadoop is a general-purpose system that allows high-performance data processing over a set of distributed nodes. Besides, it is a multi-tasking system that processes multiple data sets for different jobs for multiple users in parallel. Earlier, in Hadoop, there was a single schedular supported, which was intermixed with the JobTracker logic. This implementation was perfect for the traditional batch jobs in Hadoop.

For scheduler users' jobs,  previous versions had a very simple way. Generally, they ran in order of submission using a Hadoop FIFO scheduler. 

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Types of Job Scheduling in MapReduce

There are mainly three different types of job scheduling in MapReduce:

  • First-in-first-out (FIFO)
     
  • Capacity Scheduler
     
  • Fair Scheduler

FIFO Scheduler

FIFO Scheduler

In Hadoop, FIFO is the default scheduler policy used. This scheduler gives preferences to the tasks coming first than those coming later. This scheduler keeps the application in the queue, and in order of their submission, it executes them. (first in, first out). Despite priority and size in this scheduler, the request of the first tasks in the queue is allocated first. The next task in the queue is served only when the first task is satisfied. 

Advantages 

  • Jobs are served according to their submission.
     
  • This scheduler is easy to understand also does not require any configuration.

Disadvantages

  • For shared clusters, this scheduler might not work best. If the larger tasks come before, the shorter task, then the larger tasks will use all the resources in the cluster. Due to this, the shorter tasks will be in the queue for a longer time and has to wait for their turn, which will lead to starvation.
     
  • The balance of resource allocation between long and short applications is not considered.

Capacity Scheduler

Capacity Scheduler

This scheduler permits multiple tenants to share a huge Hadoop cluster securely. This scheduler supports hierarchical queues to portray the structure of groups/organizations that utilizes the resources of the cluster. This queue hierarchy consists of three types of queues: root, parent, and leaf.

The root queue means the cluster itself, the parent queue means the group or organization, or sub-group or sub-organizations, and the leaf queue accepts application submissions. The capacity scheduler enables the sharing of the large cluster while providing capacity assurance to every organization by allocating a fraction of cluster resources to every queue.

Also, whenever there is a request for free resources already present on the queue who have completed their tasks, these resources are assigned to the applications on queues running below capacity. This gives elasticity to the organization in a cost-effective way.

Advantages

  • This scheduler provides a capacity assurance and safeguards to the organization utilizing cluster.
     
  • It maximizes the throughput and utilization of resources in the Hadoop cluster.

Disadvantages

  • Compared to the other two schedulers, a capacity scheduler is considered complex.

Fair Scheduler

Fair Scheduler

A fair scheduler permits YARN applications to share resources in large Hadoop clusters evenly. With this scheduler, you are not required to reserve a set amount of capacity because it dynamically balances resources between all the ongoing applications. All the resources in this scheduler are assigned in such a way that all the applications get an equal amount of resources.

By default, this scheduler takes scheduling fairness decisions only based on memory. The entire cluster resources are used when the single application is running. When the other tasks are submitted, the free-up resources get assigned to the new apps in order to distribute the same amount of resources for each application. It enables the short app to complete in an adequate amount of time without starving the long-lived apps.

Same as Capacity scheduler also supports a hierarchical queue to portray the structure of the long shared cluster.

In this scheduler, when an application is present in the queue, then the application gets its minimum share, but when the full guaranteed share is not required, then the excess share is distributed between other ongoing applications.

Advantages

  • It gives a reasonable way to share the cluster between the no. of users.
     
  • The fair scheduler can work with application priorities. Priorities are used as a weight to recognize the fraction of the total resources every application must get.

Disadvantages

  • Configuration is required.

When to use Each Job Scheduling in MapReduce

  • The capacity scheduler is the correct choice because we want to secure guaranteed access with the potential in order to reuse unused capacity.  
     
  • The fair scheduler works well when we use large and small clusters for the same organization with limited workloads. Also, it is helpful in the presence of various jobs.

Importance of using Hadoop Schedulers

  • If we are running a huge cluster with multiple job types, priorities, and sizes, along with multiple clients, then selecting the right kind of Hadoop scheduler becomes important.
     
  • These schedulers are important as they make sure guaranteed access to the unused level of capacity. Also, it secures ideal utilization of resources by efficiently prioritizing the job within the queues. Although, this part of the scheduler is easy, using a fair scheduler is usually the right choice if, within a single organization, there comes a difference between the number and types of clusters.
     
  • These fair schedulers can be again used to provide and unevenly distribute the pool capacity of jobs. Also, it is performed in a much more simple and comfortable way. 
     
  • Capacity schedulers are useful when we are more concerned about the queue rather than the level of pools created. Also, the configuration level of the map & reduce job type slots are handy, and the queue can't afford to get a guaranteed capacity of the Hadoop cluster.

Limitations of Job Scheduling in MapReduce

In Hadoop, whole storage is accomplished at HDFS. When the client requests a MapReduce job, then the master node( name node) transfers the MapReduce code to the slave's node, i.e., to the node in which the real data connected to the job exists. Due to large data sets, the issue of cross-switch network traffic was common in Hadoop. To handle this problem, the concept of data locality came into the picture. 

Data locality transfers the computation near the node where the real/actual data exists. This not only rises the throughput but also decreases the network traffic.

Frequently Asked Questions

What is job scheduling?

Job scheduling is the method in which different tasks are executed at a pre-determined time or whenever the right event happens. It is a system that can be merged with other software systems for the purpose of executing other software components when the scheduled time arrives.

What are the types of job scheduling in MapReduce?

There are three types of job scheduling in MapReduce: Capacity Scheduler, Fair Scheduler, and FIFO Scheduler. All these Schedulers are a kind of algorithm that is used to schedule tasks in a Hadoop cluster when we receive requests from multiple clients.

What is a Job Queue?

A Job queue is a collection of multiple tasks that we have received from multiple clients. The tasks are present in the queue and we are required to schedule these tasks on the basis of the requirements.

Conclusion

In this article, we have learned about job scheduling in Mapreduce in detail. We discussed the types of job scheduling in MapReduce. Also, we saw when to use job scheduling in MapReduce. I hope you like this article on Job scheduling in MapReduce.To better understand the topic, you can refer to MapReduce fundamentalsthe synchronization of tasks in MapReduce, and Hadoop MapReduce.

For more information, refer to our Guided Path on Coding Ninjas Studio to upskill yourself in PythonData Structures and AlgorithmsCompetitive ProgrammingSystem Design, and many more! 

Head over to our practice platform, Coding Ninjas Studio, to practice top problems, attempt mock tests, read interview experiences and interview bundles, follow guided paths for placement preparations, and much more! 

Happy Learning Ninja!

Previous article
Hadoop MapReduce
Next article
Building the Big Data Foundation with Hadoop
Live masterclass