Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Scheduling
2.1.
FIFO Scheduler
2.1.1.
Advantages
2.1.2.
Disadvantages
2.2.
Capacity Scheduler
2.2.1.
Advantages
2.2.2.
Disadvantages
2.3.
Fair Scheduler
2.3.1.
Advantages
2.3.2.
Disadvantages
3.
Synchronization
4.
Colocation of Code and Data
5.
Fault/Error handling
5.1.
What happens if there is a failure? 
6.
FAQs
6.1.
Why is MapReduce used so often?
6.2.
What should we do first, mapping or reducing?
6.3.
Are there always Errors/Faults in the MapReduce framework?
7.
Conclusion
Last Updated: Mar 27, 2024
Easy

Foundational behaviors of MapReduce

Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

MapReduce is a programming model for creating Big Data applications that run parallel on multiple nodes. MapReduce is an analytical framework for evaluating large amounts of detailed data. We need to understand some characteristics of the execution framework to know why things work the way they do. This understanding will help us design better applications and optimize the execution for efficiency.

There are four foundational pillars of behaviors of MapReduce that are being worked on to enrich our programming model:

  1. Scheduling
  2. Synchronization
  3. Colocation of code and data
  4. Fault/Error handling
     

Let’s go through each of them one by one:

Scheduling

 As we know, there are two portions in MapReduce. One is the “map,” and the other is “reduce." We divide both parts of the application into distinct tasks. The mapping must be complete before lowering. These tasks are prioritized based on the number of nodes in the cluster. If there are more map tasks than nodes, the execution framework will handle them until they are finished. The decreased tasks will behave in the same way. The procedure is complete only when all of the reduction tasks have been completed successfully.

There are mainly three types of schedulers in Hadoop :  

  1. FIFO (First In First Out) Scheduler.
  2. Capacity Scheduler.
  3. Fair Scheduler.
     

These Schedulers are a type of algorithm that we use when getting requests from various customers.
 

FIFO Scheduler

As the name implies, FIFO stands for First In First Out, which means that the tasks or applications that arrive first are served first. The jobs are placed in a queue and completed in the order they are submitted. No intervention is permitted once the project has been booked using this manner. As a result, because the task's priority does not matter in this manner, the high-priority process may have to wait for a lengthy period.

Advantages

  • It's easy to use and doesn't require any configuration.
  • Jobs are completed in the order in which they were submitted.

Disadvantages

  • It is not ideal for clusters that are shared. If the large program runs ahead of the smaller ones, the huge application will consume all of the cluster's resources, and the minor application will have to wait for its turn. This results in hunger.
  • It does not account for the resource allocation balance between long and short applications.

Capacity Scheduler

We have many work queues in Capacity Scheduler for scheduling our tasks. Multiple inhabitants can share a large Hadoop cluster with the Capacity Scheduler. We assign some slots or cluster resources for completing job operations in the Capacity Scheduler corresponding to each job queue. Each job queue has its own set of slots for completing its tasks. If we only have tasks to do in one queue, those tasks can access the slots of other queues because they are free to utilize, and when a new task enters another queue, tasks now running in that cluster's slots are replaced with their task.

Source 

Advantages

  • It maximizes the Hadoop cluster's resource use and throughput.
  • Provides groups or organizations with cost-effective elasticity.
  • It also provides capacity guarantees and safeguards to the cluster-using organization.

Disadvantages

  • Among the other schedulers, it is the most complicated.

Fair Scheduler

The capacity scheduler and the fair scheduler are highly similar. The job's priority is taken into consideration. The resources are allocated so that each application in a cluster receives the same amount of time. A fair scheduler makes scheduling decisions based on memory but can also be configured to work with the CPU.

Advantages

  • It provides a suitable solution for a large number of users to share the Hadoop Cluster.
  • In addition, the FairScheduler may interact with app priorities, which are utilized as weights in calculating what fraction of total resources each program should receive.

Disadvantages

  • It has to be configured.
Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Synchronization

When numerous processes run in a cluster simultaneously, we need a mechanism to keep everything in sync. This is done automatically by synchronization mechanisms. The execution framework maintains track of what has been executed and when the program is mapping and reducing. The reduction process begins once all of the mappings are complete.


Using a method known as "shuffle and sort," intermediate data is copied over the network as it is created. This gathers and prepares all of the data that has been mapped for reduction. 

Synchronization, in general, refers to the methods that allow multiple concurrently operating processes to "join up," for example, to share intermediate results or exchange state information. A barrier between the map and reduced processing phases achieve synchronization in MapReduce. Intermediate key-value pairs must be grouped by key, which is done via a vastly distributed sort, including all nodes that performed map jobs and all of the nodes that will perform reduction operations. Because this necessitates copying intermediate data over the network, the process is usually referred to as "shuffle and sort." Because each mapper may have intermediate output traveling to each reducer, a MapReduce job with m mappers and r reducers can include up to m*r separate copy operations.

On the other hand, the reducer in MapReduce receives all matters associated with the same key simultaneously.

Colocation of Code and Data

 The most efficient processing occurs when the mapping functions (code) are placed on the same machine as the data to be processed. Before execution, the process scheduler is quite intelligent and can put the code and related data on the same node (or vice versa). 

The term "data distribution" is confusing because one of the main goals of MapReduce is to transport code rather than data. However, the more significant problem remains: for computation to take place, it must somehow feed data into the code. This problem is curiously interwoven with scheduling in MapReduce, and it is significantly reliant on the design of the underlying distributed file system. The scheduler achieves data locality by starting tasks on the node with a specific data block (i.e., on its local disc) that the job requires. This results in the code being moved to the data. If this isn't possible (for example, if a node is overburdened), it will launch new jobs elsewhere and broadcast the data. Because inter-rack bandwidth is substantially less than intra-rack bandwidth, choosing nodes on the same rack in the data center as the node containing the necessary data block is vital.

Fault/Error handling

The MapReduce execution framework must complete all of the above tasks in an environment where failures and faults are the usual, not the exception. MapReduce was created with low-cost commodity servers, and the runtime must be highly durable. Disk failures are prevalent in large clusters, and RAM has more mistakes than one may think. Data Centers experience both scheduled and unplanned disruptions (for example, system maintenance and hardware upgrades) (e.g., power failure, connectivity loss, etc.). That's simply the hardware. Exceptions must be adequately trapped, logged, and recovered from any software. Furthermore, any sufficiently huge dataset would contain damaged data or records distorted beyond a programmer's imagination, resulting in errors that no one would expect to check for or trap. In this hostile environment, the MapReduce execution framework must thrive.

What happens if there is a failure? 

Nothing, hopefully. The error handling and fault tolerance of most MapReduce engines are excellent. Something will break in a MapReduce cluster with all the nodes and pieces in each node. Something is wrong, and the engine needs to figure out what it is and fix it. If some of the mapping tasks do not return as complete, the engine may assign the jobs to a different node to complete the job. The machine is programmed to recognize when a work is incomplete and immediately transfer it to another node.

FAQs

Why is MapReduce used so often?

MapReduce is one of the most straightforward programming models for creating Big Data applications. MapReduce is an analytical framework for evaluating large amounts of detailed data with ease, which is why it is often used.

What should we do first, mapping or reducing?

The mapping must be completed before lowering and then reducing. 

Are there always Errors/Faults in the MapReduce framework?

There is a strong possibility that you’ll find errors while working on the framework, but the error/fault handling technique is excellent in the MapReduce framework. 

Conclusion

In this article, we have extensively discussed the foundational behaviors of MapReduce. We hope that this blog has helped you enhance your knowledge and aspects that you should keep in mind while dealing with the MapReduce framework; here is an article on HBase features that carries MapReduce jobs. If you would like to learn more, check out our articles here. You can also check the introduction to Hadoop and its ecosystem here, and the difference between Sparks and Hadoop here. If you want to explore and learn big data, make sure to check out this. Do upvote our blog to help other ninjas grow. 

Learning never stops, and to feed your quest to learn and become more skilled, head over to our practice platform Coding Ninjas Studio to practice top problems, you can check SQL problems here,  attempt mock tests, read interview experiences, and you can also check our guided path for the coding interview and much more!

Happy Learning!

Previous article
Putting Map and Reduce Together
Next article
MapReduce Python
Live masterclass