Table of contents
1.
Introduction
2.
The Execution Framework
2.1.
Scheduling
2.2.
Synchronization
2.3.
Code/data colocation
2.4.
Fault/error handling
3.
Frequently Asked Questions
3.1.
What is shuffle and sort in MapReduce?
3.2.
What is a scheduler in big data?
3.3.
What is the need for MapReduce?
3.4.
Which of the following phases co-occur in MapReduce?
4.
Conclusion
Last Updated: Mar 27, 2024

Synchronization of Tasks in MapReduce

Author Palak Mishra
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

MapReduce's architecture is primarily based on master-slave architecture. The processing phases of a task assigned to slave nodes are map and reduce. Intermediate results are generated by the mapping process and fed into the reduction process as input.

Reduce sorts the intermediate results by keys before merging them into a single final output. Between the map and reduce operations, there is a synchronization step.

The synchronization phase is a data communication step between the mapper and reducer nodes to start the reduction process. 

The Execution Framework

As we know, MapReduce can run on multiple machines in a network and produce the same results as if all of the work were done on a single machine. It can also pull data from a variety of internal and external sources. MapReduce keeps track of its work by generating a unique key that ensures all processing is related to the same problem. 

This key is also used to bring all the output from the distributed tasks together at the end.

When the map and reduce functions are combined, they operate as a single job within the cluster. The MapReduce engine's execution framework handles all of the dividing and conquering, and all of the work is distributed to one or more nodes in the network. To better understand why things work the way they do, you must first understand some characteristics of the execution framework.

This can aid in developing better applications and optimizing their execution for performance and efficiency. MapReduce's foundational behaviors are as follows:

Scheduling

Each MapReduce job is broken down into smaller chunks known as tasks. A map task, for example, might be in charge of processing a specific block of input key-value pairs (known as an input split in Hadoop), while a reduce task might be in charge of a portion of the intermediate keyspace. It's not uncommon for MapReduce jobs to have tens of thousands of individual tasks that must be assigned to cluster nodes. 

When working on large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster at the same time, necessitating the use of a task queue and tracking the progress of running tasks so that waiting tasks can be assigned to nodes as they become available.

Coordination of tasks belonging to different jobs is another aspect of scheduling (e.g., from other users). Speculative execution (also known as "backup tasks") is an optimization implemented by Hadoop and Google's MapReduce implementation. The map phase of a job is only as fast as the slowest map task due to the barrier between the map and reduce tasks. Similarly, the time it takes to complete a job is limited by the time it takes to complete the slowest reduce task.

As a result, stragglers, or tasks that take an unusually long time to complete, can affect the speed of a MapReduce job. Flaky hardware is one source of stragglers: for example, a machine with recoverable errors may become significantly slower. An identical copy of the same task is run on a different machine when speculative execution is used. The framework uses the result of the first task attempt to complete the task.

Zaharia et al. discussed various execution strategies in a recent paper, and Google claims that speculative execution can reduce job running times by 44 percent.

Although both map and reduce tasks in Hadoop can be executed speculatively, the consensus is that the technique is more beneficial for map tasks than for reduce tasks because each copy must pull data over the network. However, another common source of stragglers is skew in the distribution of values associated with intermediate keys, which speculative execution cannot adequately address (reducing stragglers). 

Zipfian distributions are common in text processing, which means that the task or tasks responsible for processing the most common few elements will take much longer than the typical task. One solution to this problem is better local aggregation.                                             

 Source: Image

Synchronization

Synchronization, in general, refers to the mechanisms that allow multiple concurrently running processes to "join up," for example, to share intermediate results or exchange state information. A barrier between the map and reduce processing phases achieve synchronization in MapReduce. 

Intermediate key-value pairs must be grouped by key, which is done using a large distributed sort involving all nodes that perform map tasks and all of the nodes that will perform reduce tasks. Because this necessitates copying intermediate data over the network, the process is commonly referred to as "shuffle and sort."

Because each Mapper may have intermediate output going to each reducer, a MapReduce job with m mappers and r reducers can involve up to m r distinct copy operations.

Because the execution framework cannot guarantee that all values associated with the same key have been gathered unless all mappers have finished emitting key-value pairs and all intermediate key-value pairs have been shuffled and sorted, the reduce computation cannot begin until all mappers have finished casting key-value pairs and all intermediate key-value pairs have been shuffled and sorted.

The aggregation function g in a fold operation is a function of the intermediate value and the next item in the list, meaning values can be generated lazily. Aggregation can start as soon as values are available. On the other hand, the reducer in MapReduce receives all values associated with the same key simultaneously. However, as soon as each Mapper completes, you can begin copying intermediate key-value pairs over the network to the nodes running the reducers—this is a standard optimization that Hadoop implements.
                                           

 Source: image
 

Code/data colocation

The term "data distribution" is misleading because one of the main goals of MapReduce is to move code rather than data. However, the more significant point remains: for computation to take place, data must be fed into the code somehow. This problem is inextricably linked to scheduling in MapReduce, and it is heavily reliant on the design of the underlying distributed file system. The scheduler achieves data locality by starting tasks on the node that has a specific block of data (i.e., on its local drive) that the job requires. This results in the code being moved to the data. 

If this isn't possible (for example, if a node is already overburdened), new tasks will be started elsewhere, and the required data will be streamed over the network. Because inter-rack bandwidth is significantly less than intra-rack bandwidth, it is essential to prefer nodes on the same rack in the data center as the node holding the relevant data block.

Fault/error handling

The MapReduce execution framework must complete all of the tasks above in an environment where errors and faults are the norms, not the exception. Because MapReduce was created with low-cost commodity servers, the runtime must be highly resilient. Disk failures are common in large clusters, and RAM has more errors than expected. Datacenters experience planned and unplanned outages (for example, system maintenance and hardware upgrades) (e.g., power failure, connectivity loss, etc.).

That's just in terms of hardware. Exceptions must be caught, logged, and recovered for the software to be bug-free. Large-data issues have a habit of revealing obscure corner cases in otherwise bug-free code. 

Furthermore, any sufficiently large dataset will contain corrupted data or records beyond a programmer's imagination, resulting in errors one would never think to check for or trap. This hostile environment requires the MapReduce execution framework to thrive.

Click on the following link to read further: Multitasking Operating System

Frequently Asked Questions

What is shuffle and sort in MapReduce?

The shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce. The sort phase in MapReduce covers the merging and sorting of map outputs. Data from the Mapper are grouped by the key, split among reducers, and sorted by the key. Every reducer obtains all values associated with the same key.

What is a scheduler in big data?

What are Hadoop Schedulers? A general-purpose system that enables the high-performance processing of data over a set of distributed nodes is what we call Hadoop. Moreover, it is a multitasking system that simultaneously processes multiple data sets for various jobs for various users.

What is the need for MapReduce?

MapReduce is a method for quickly and efficiently processing large amounts of data. For efficient processing, sophisticated techniques are required. For indexing its web pages, Google developed MapReduce technology, which replaced its previous algorithms.

Which of the following phases co-occur in MapReduce?

While map outputs are being fetched, the shuffle and sort phases co-occur.

Conclusion

In this blog, we extensively discuss the walkthrough of how MapReduce works. The basic concept and the underlying Execution Framework are tightly integrated components of the MapReduce environment and their benefits. We hope that this blog has helped you enhance your knowledge regarding the subject of the Synchronization of tasks in MapReduce. The knowledge never stops, have a look at more related articles: Data WarehouseMongoDB, AWS, and many more. To learn more, see Operating SystemUnix File SystemFile System Routing, and File Input/Output.

Refer to our Guided Path on Coding Ninjas Studio to upskill yourself in Data Structures and AlgorithmsCompetitive ProgrammingJavaScriptSystem Design, and many more! If you want to test your competency in coding, you may check out the mock test series and participate in the contests hosted on Coding Ninjas Studio! But if you have just started your learning process and are looking for questions asked by tech giants like Amazon, Microsoft, Uber, etc; you must look at the problemsinterview experiences, and interview bundle for placement preparations.

Do upvote our blogs if you find them helpful and engaging!

Happy coding!

Thank you for reading. 

 

Live masterclass