Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
YARN
2.1.
Architecture of YARN
2.2.
Application Running Process in YARN 
2.3.
Advantages of YARN 
2.4.
Limitations of YARN
3.
MapReduce
3.1.
Architecture of MapReduce
3.1.1.
Map Phase 
3.1.2.
Reduced Phase
3.2.
Advantages of MapReduce
3.3.
Limitations of MapReduce  
4.
YARN vs MapReduce
5.
Which one to use?
6.
Frequently Asked Questions 
6.1.
What does YARN stand for?
6.2.
What is the purpose of the YARN container?
6.3.
What is the default scheduler used by YARN?
6.4.
What is the maximum number of reducers to use in a MapReduce job?
6.5.
What is a YARN node manager?
7.
Conclusion
Last Updated: Mar 27, 2024
Medium

YARN vs MapReduce

Author Rishabh
0 upvote

Introduction

YARN is a resource management framework in Hadoop that supports various workloads. MapReduce is a framework for processing large parallel datasets across multiple Hadoop cluster nodes. We will learn about both frameworks in detail in this article and understand the difference between YARN vs MapReduce.

YARN vs MApReduce

YARN

YARN(Yet Another Resource Negotiator) is a resource management framework introduced in Hadoop 2.0 to replace the MapReduce job tracker. We use YARN for storing and processing large amounts of data.

YARN is a platform for managing and allocating resources such as CPU and memory across multiple applications on a Hadoop cluster. 

It shares resources across multiple users and applications. It supports various workloads, such as batch and interactive processing. YARN is known as Resource Manager, ApplicationMaster, NodeManager, and a distributed application.

Architecture of YARN

YARN architecture
  • YARN which stands for yet another resource negotiator is a resource management and scheduling framework in Hadoop
     
  • It has a Resource manager, Node Manager and an Application master for each application. Resource manager in YARN acts as a central coordinator for managing and allocating resources in a cluster.
     
  • Node managers are responsible for managing resources on individual nodes in the cluster. They are responsible for monitoring the resources such as the CPU and memory.
     
  • The Application master is responsible for coordinating the execution of tasks within an application. 
     
  • YARN supports various types of workloads, making it a key component in large-scale distributed data processing systems.

Application Running Process in YARN 

  • Apply YARN.
     
  • YARN allocates the resources from the Hadoop cluster to the application, such as CPU, memory.
     
  • YARN launches applications in the containers on the selected cluster nodes. The application runs within these containers.
     
  • YARN ensures that the application has the required resources and monitors the application's progress.
     
  • After the application is finished, YARN cleans up the resources it uses.

Advantages of YARN 

  • YARN provides a highly scalable framework for managing resources in the Hadoop cluster. It efficiently handles large scale data workloads in Hadoop clusters.
     
  • It supports various data processing frameworks that allow users to choose the most suitable framework according to their application requirements.
     
  • It provides a resource manager that ensures fair sharing of cluster resources among different workloads.
     
  • It promotes a platform for new technologies and innovations without affecting the stability of the cluster.

Limitations of YARN

  • YARN’s resource management framework supports resource manager and node manager which consumes a lot of memory.
     
  • It has an additional complexity as compared to the original MapReduce framework. It might be challenging for first time users.
     
  • The additional components of YARN require more computational resources such as CPU which can increase the hardware and infrastructure costs.

MapReduce

MapReduce is a programming model and framework introduced in Hadoop 1.0. It was a primary processing framework in Hadoop until the introduction of YARN in Hadoop 2.0. 

We use MapReduce to process large datasets across multiple nodes in a Hadoop cluster. MapReduce breaks down a large dataset into smaller chunks, then processes it independently.

The “Map” phase involves processing each chunk of data independently to generate key-value pairs. The “Reduce” phase aggregates the key-value pairs to generate the final output.

Architecture of MapReduce

MapReduce simplifies the process of performing large computations into two steps: the map phase and the reduced phase.

MapReduce Architecture

Map Phase 

  • In the map phase we divide the data into smaller chunks and process the independently.
     
  • Each map performs a specific computational task and transformation on them.
     
  • The output of the map phase is a collection of key-value pairs.
     
  • Once the map phase gets over the key-value pairs are grouped to prepare for their reduced phase.
     

Reduced Phase

  • In the reduced phase the key value pairs are processed by the reduce tasks in parallel.
     
  • Each reduced task performs some operations on the key value pair.
     
  • The output of the reduced phase is a set of key-value pairs.

Advantages of MapReduce

  • MapReduce enables large scale data by distributing work across multiple machines in a cluster.
     
  • It handles node failures. It is applicable to a wide range of data processing tasks including batch processing.
     
  • It has cost-efficient distributed processing without the need for specialized infrastructure.

Limitations of MapReduce  

  • The MapReduce is not suitable for processing real-time data. We generally use it for processing large datasets.
     
  • The MapReduce can be complex and difficult to program for specific workloads.
     
  • MapReduce takes a long job completion time for small and medium-sized datasets.
     

YARN vs MapReduce

Now we will look at the difference between YARN and MapReduce in more detail.

YARN  MapReduce
YARN is a resource management framework in Hadoop Programming. MapReduce is used for processing large datasets in parallel across multiple nodes.
It supports various types of datasets. It is primarily for large datasets.
It is easier to use and is more flexible. It is complex and difficult to program.
It is highly scalable. It is limited to only hardware.

Which one to use?

  • YARN is used when we need a resource management framework that can handle a wide range of applications. It can run different workloads like batch processing and real time screening.
     
  • YARN offers advanced features such as dynamic resource allocation. It utilizes fair allocation between resources.
     
  • MapReduce is particularly useful for large-scale batch processing tasks, such as processing massive amounts of data.
     
  • MapReduce manages task failures by reassigning tasks to other available nodes, minimizing the impact of failures.

Frequently Asked Questions 

What does YARN stand for?

YARN stands for Yet Another Resource Negotiator.

What is the purpose of the YARN container?

YARN container provides resources for running a specific application.

What is the default scheduler used by YARN?

The default scheduler used by YARN is the capacity scheduler.

What is the maximum number of reducers to use in a MapReduce job?

The number of available cluster nodes limits the maximum number of reducers.

What is a YARN node manager?

YARN node manager manages the resources such as memory and CPU on that node.

Conclusion

In this article, we discussed the difference between YARN and MapReduce. You can also read the article difference between NPM and YARN to improve your knowledge about List and Set.

To learn more, check out our articles:

To learn more about DSA, competitive coding, and many more knowledgeable topics, please look into the guided paths on Coding Ninjas Studio. Also, you can enroll in our courses and check out the mock test and problems available. Please check out our interview experiences and interview bundle for placement preparations.

Happy Coding!

Live masterclass