Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
What is YARN Architecture?
3.
Hadoop YARN Architecture and Components
3.1.
Resource Manager
3.2.
Node Manager
3.3.
Containers
3.4.
Application Master
3.5.
Resource Manager
4.
Key Functions of the Resource Manager
5.
Node Manager
5.1.
Key Responsibilities of the Node Manager
6.
Containers
6.1.
Key Aspects of Containers in YARN
7.
Application Master
8.
Main Functions of the Application Master
9.
YARN Architecture Features
9.1.
Scalability
9.2.
Multi-tenancy
9.3.
Resource Allocation
9.4.
Fault Tolerance
9.5.
Flexibility
9.6.
Security
9.7.
Monitoring
10.
Application Workflow in Hadoop YARN
10.1.
Application Submission
10.2.
Application Acceptance
10.3.
Resource Negotiation
10.4.
Container Allocation
10.5.
Task Execution
10.6.
Task Monitoring
10.7.
Application Completion
11.
Frequently Asked Questions
11.1.
What is the role of the Resource Manager in YARN?
11.2.
How does YARN improve resource utilization in Hadoop?
11.3.
Can existing MapReduce jobs run on YARN?
12.
Conclusion
Last Updated: May 6, 2024
Easy

Yarn Architecture

Author Ravi Khorwal
0 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

YARN (Yet Another Resource Negotiator) is the resource management layer of Apache Hadoop. This technology is essential for managing and scheduling resources across clusters. YARN allows multiple data processing engines such as real-time streaming and batch processing to handle data stored in a single platform, which makes Hadoop more efficient and scalable. 

Yarn Architecture

In this article, we will explore the various components of YARN architecture, understand its operation, and discuss its features and application workflow.

What is YARN Architecture?

YARN (Yet Another Resource Negotiator) is a key component of Apache Hadoop, which is designed to manage computing resources in clusters and utilize them effectively. YARN provides a platform to execute and manage processing activities on large data sets. It consists of a master daemon known as the Resource Manager, node-specific agents called Node Managers, and containers where specific tasks are executed.

YARN improves upon Hadoop's original data processing method by separating the roles of job scheduling and resource management into different components. This separation increases efficiency and allows for more flexible data processing operations. By handling cluster resource management, YARN enables Hadoop to support more varied processing approaches and a broader range of applications, making it a versatile tool for big data solutions.

The architecture of YARN enhances scalability and cluster utilization by allowing multiple applications to run simultaneously on Hadoop. Each application has its own application master which negotiates resources from the Resource Manager and works in conjunction with the Node Manager to execute and monitor the tasks.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Hadoop YARN Architecture and Components

YARN architecture is built to improve the resource management capabilities of Hadoop by introducing several key components. Each plays a specific role in managing the cluster resources and processing tasks efficiently. Here are the main components of YARN:

Resource Manager

This is the heart of the YARN architecture. It manages the use of resources across the cluster. The Resource Manager has two main components:

  • Scheduler: The Scheduler is responsible for allocating resources to various running applications based on abstract notions of cluster capacity, such as memory or CPU. It doesn't track or monitor the status of applications, which keeps it simple and scalable.
     
  • Application Manager: This manages the application lifecycle and the entire process workflow, from application start to finish. It accepts job submissions, negotiates the first container for executing the application-specific Application Master, and restarts the Application Master container on failure.

Node Manager

A Node Manager is a per-machine framework agent responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Resource Manager. It also manages application execution and container life-cycle management.

Containers

Containers execute specific tasks. When an application is launched, the Application Master negotiates container resources with the Resource Manager, which then utilizes Node Manager to launch and monitor the containers. Containers are the execution component of YARN, holding resources such as memory, CPU, and disks that are necessary to execute a task.

Application Master

Each application has its own instance of an Application Master. It is responsible for negotiating appropriate resource containers from the Scheduler, tracking their status, and monitoring their progress. Application Masters have the task of coordinating with the Node Manager to execute and monitor tasks.

Resource Manager

The Resource Manager is a fundamental component of YARN that acts as the central authority managing resources and scheduling tasks across the cluster. Its primary function is to manage the computing resources in various nodes and allocate them to different running applications according to need.

Key Functions of the Resource Manager

  • Resource Allocation: It assesses the demands of various applications and allocates resources accordingly. This involves deciding which application gets what amount of resources (memory, CPUs) based on the application’s priority and other factors such as queue configuration.
     
  • Cluster Management: It keeps track of the health of the nodes in the cluster and manages the distribution of software configurations and other administrative tasks. By monitoring the nodes, it ensures that the resources are always available for applications and can handle node failures smoothly.
     
  • Job Scheduling: The Resource Manager has a pluggable scheduler component, which means the algorithm that decides how resources are distributed among the various applications can be swapped out depending on the requirements of the cluster. Commonly used schedulers include the Capacity Scheduler and the Fair Scheduler, which ensure that resources are shared fairly among all the applications or according to specific rules.
     

The Resource Manager is designed for scalability and can manage a large number of resources spread across many nodes in a Hadoop cluster. It optimizes resource utilization and ensures that user applications run efficiently. Its architecture is robust and can handle the loss of a Node Manager by reallocating tasks to other nodes.

Node Manager

The Node Manager is the YARN component that operates at the node level in a Hadoop cluster. It is responsible for managing the application's execution environment, monitoring resource usage (CPU, memory, disk, network) on each node, and reporting this information back to the Resource Manager.

Key Responsibilities of the Node Manager

  • Container Management: One of the primary roles of the Node Manager is to manage containers. Containers are the execution units in YARN, and the Node Manager is in charge of starting, stopping, and managing these containers according to the directives from the Resource Manager.
     
  • Resource Monitoring: It continuously monitors the resource usage of each container and reports this data to the Resource Manager. This monitoring helps in maintaining an efficient allocation of resources and ensures that no single application over-utilizes resources to the detriment of others in the cluster.
     
  • Node Health Checks: The Node Manager also performs health checks on the node to ensure it is functioning properly. It checks for disk failures, software misconfigurations, and other system-level issues. If a problem is detected, the Node Manager can report these issues to the Resource Manager, which then takes steps to mitigate the issue, such as reallocating tasks to other nodes.
     
  • Log Management: Handling logs generated by applications and system components is another critical function. The Node Manager collects and manages these logs, which are crucial for debugging and understanding application behavior.
     

The Node Manager's ability to manage tasks and resources effectively makes it a vital component of the YARN architecture, enabling Hadoop to run applications smoothly across all nodes in the cluster.

Containers

In the context of YARN (Yet Another Resource Negotiator), containers play a critical role in managing the execution of tasks. A container in YARN represents a collection of physical resources, such as memory, CPU, and disks, on a single node.

Key Aspects of Containers in YARN

  • Resource Isolation: Containers provide isolated environments for tasks. This isolation helps in managing the resources more effectively because each container only has access to the resources that were allocated to it. This prevents tasks from using more resources than they were granted, which ensures fair resource usage across all tasks.
     
  • Task Execution: Each container runs specific tasks assigned to it by the Application Master. These tasks are pieces of a larger application, and their successful execution contributes to the application’s overall completion. Containers can run tasks from different applications simultaneously, leveraging the node’s resources efficiently.
     
  • Flexibility and Scalability: Since containers abstract the resource usage details from the tasks, they allow applications to scale dynamically. More containers can be added or removed as needed without disrupting the overall process. This flexibility is crucial for handling varying workloads efficiently.
     
  • Lifecycle Management: The Node Manager handles the lifecycle of each container, from starting it with the specified resources to stopping it once the task is complete. This management includes monitoring the container’s resource usage and ensuring it does not exceed its allocation.
     

Containers are foundational to YARN’s ability to manage cluster resources dynamically and efficiently. They allow the system to maintain high levels of utilization and provide the flexibility needed to handle large-scale data processing tasks.

Application Master

The Application Master is a unique and pivotal component in the YARN architecture, tasked with managing the lifecycle of applications within the Hadoop ecosystem. It negotiates resources with the Resource Manager, monitors their use, and tracks the progress of application execution.

Main Functions of the Application Master

  • Resource Negotiation: The Application Master starts by requesting the necessary resources from the Resource Manager to execute the application tasks. It specifies the type and quantity of resources needed (like CPU, memory, and storage) and continues to request additional resources as the application runs, based on its needs.
     
  • Task Monitoring: Once the containers are allocated and tasks are running, the Application Master monitors the progress of each task within these containers. It handles any failures by requesting new containers from the Resource Manager and restarting tasks as necessary.
     
  • Coordination of Task Execution: The Application Master coordinates the execution order of tasks, ensuring that they are carried out efficiently and correctly. This involves scheduling tasks based on their dependencies and the availability of data.
     
  • Shutdown and Cleanup: After all the tasks have been successfully completed, the Application Master shuts down, releasing all the resources it had been using. It also handles any necessary cleanup operations to ensure that the system is ready for the next application.
     

The Application Master thus serves as the brain of the application execution process, making dynamic decisions about resource allocation and task management. Its role is critical for the efficient operation and scaling of applications in a Hadoop environment.

YARN Architecture Features

The YARN architecture has several key features that make it well-suited for managing resources in a Hadoop cluster. Some of the main features of YARN include:

Scalability

YARN is designed to scale to large clusters with thousands of nodes & tens of thousands of tasks. It achieves this scalability by separating resource management from job scheduling & execution, allowing each component to scale independently.

Multi-tenancy

YARN allows multiple applications to run simultaneously on the same Hadoop cluster, with each application having its own Application Master & set of containers. This multi-tenancy enables better utilization of cluster resources & allows different teams or users to share the same cluster.

Resource Allocation

YARN provides a flexible & dynamic resource allocation model that allows applications to request & release resources as needed. The Resource Manager allocates resources to applications based on their requirements & the available resources in the cluster, ensuring that each application has access to the resources it needs.

Fault Tolerance

YARN is designed to be fault-tolerant & can handle failures at various levels of the architecture. If a task fails, the Application Master can request a new container from the Resource Manager & restart the task. If an Application Master fails, the Resource Manager can restart it on a different node.

Flexibility

YARN provides a flexible platform for running different types of applications on Hadoop. It supports batch processing, interactive querying, real-time streaming, & iterative algorithms, among others. This flexibility enables organizations to use Hadoop for a wide range of use cases.

Security

YARN provides several security features to ensure that applications & data are protected. It integrates with Kerberos for authentication & supports access control lists (ACLs) for authorization. It also provides encryption for data in transit & at rest.

Monitoring

YARN provides a web-based UI for monitoring the status of applications, containers, & nodes in the cluster. It also provides REST APIs for programmatic access to monitoring data, enabling integration with external tools & systems.

Application Workflow in Hadoop YARN

Now that we have explored the architecture & features of YARN, let's take a look at how an application is executed in a YARN environment. The application workflow in YARN involves the following steps:

Application Submission

The client submits an application to the YARN Resource Manager. The application submission includes the application jar file, the resource requirements for the application, & other configuration parameters.

Application Acceptance

The Resource Manager accepts the application submission & creates an Application Master (AM) for the application. The AM is responsible for negotiating resources from the Resource Manager & managing the application's execution.

Resource Negotiation

The Application Master negotiates with the Resource Manager for the resources required to run the application. The AM requests containers from the Resource Manager, specifying the resource requirements for each container (e.g. memory, CPU).

Container Allocation

The Resource Manager allocates containers to the Application Master based on the resource requirements & the available resources in the cluster. The containers are allocated on specific nodes in the cluster & are managed by the Node Managers on those nodes.

Task Execution

Once the containers are allocated, the Application Master launches the application's tasks in the containers. The tasks are executed on the nodes where the containers are located & are managed by the Node Managers on those nodes.

Task Monitoring

The Application Master monitors the progress of the tasks & the status of the containers. If a task fails, the AM can request a new container from the Resource Manager & restart the task.

Application Completion

Once all the tasks are completed, the Application Master notifies the Resource Manager that the application has finished. The Resource Manager then releases the resources allocated to the application & removes the AM from the system.

Here's a simple diagram that illustrates the application workflow in YARN:

Client -> Resource Manager -> Node Manager
  |                |               |
  |                |               |
  |                V             V
  |          Application Master -> Containers
  |                                  |
  |                                 V
  |                                Tasks
  V
Application Completion

This workflow enables YARN to efficiently manage resources & execute tasks in a Hadoop cluster. The separation of concerns between the Resource Manager, Application Master, & Node Managers allows YARN to scale to large clusters & handle a wide range of applications.

Frequently Asked Questions

What is the role of the Resource Manager in YARN?

The Resource Manager orchestrates the allocation of computing resources in the cluster, manages applications' life cycles, and schedules tasks across the available nodes.

How does YARN improve resource utilization in Hadoop?

YARN enhances resource utilization by dynamically allocating resources based on real-time demands of applications, ensuring optimal use of the cluster and reducing resource wastage.

Can existing MapReduce jobs run on YARN?

Yes, YARN is designed to be backward compatible with MapReduce, allowing existing MapReduce applications to run on YARN without any modifications.

Conclusion

In this article, we have learned about YARN architecture, its components, and the workflow of applications in a Hadoop environment. We discussed the roles of the Resource Manager, Node Manager, Application Master, and containers, and also talked about how YARN optimizes resource utilization and supports multiple processing frameworks. This understanding of YARN will help you efficiently manage and scale Hadoop applications in your computing environment.

You can refer to our guided paths on the Coding Ninjas. You can check our course to learn more about DSADBMSCompetitive ProgrammingPythonJavaJavaScript, etc. Also, check out some of the Guided Paths on topics such as Data Structure andAlgorithmsCompetitive ProgrammingOperating SystemsComputer Networks, DBMSSystem Design, etc., as well as some Contests, Test Series, and Interview Experiences curated by top Industry Experts.

Previous article
Transparent Encryption in HDFS
Next article
Hadoop MapReduce
Live masterclass