Table of contents
1.
Introduction
2.
What is Hadoop YARN?
3.
Why is YARN in Hadoop Used?
4.
YARN Architecture its Components
4.1.
1. Client
4.2.
2. Resource Manager
4.3.
3. Node Manager
4.4.
4. Containers
4.5.
5. Application Master
5.
YARN Architecture Features
5.1.
1. Scalability
5.2.
2. Multi-tenancy
5.3.
3. Resource Allocation
5.4.
4. Fault Tolerance
5.5.
5. Flexibility
5.6.
6. Security
5.7.
7. Monitoring
6.
Application Workflow in Hadoop YARN
6.1.
1. Application Submission
6.2.
2. Application Acceptance
6.3.
3. Resource Negotiation
6.4.
4. Container Allocation
6.5.
5. Task Execution
6.6.
6. Task Monitoring
6.7.
7. Application Completion
7.
Advantages of YARN Architecture
8.
Disadvantages of YARN Architecture
9.
Frequently Asked Questions
9.1.
Why is YARN better than MapReduce?
9.2.
What is the difference between YARN and HDFS?
9.3.
How does YARN improve resource utilization in Hadoop?
9.4.
Can existing MapReduce jobs run on YARN?
10.
Conclusion
Last Updated: Oct 7, 2024
Easy

Hadoop Yarn Architecture

Author Ravi Khorwal
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

YARN (Yet Another Resource Negotiator) is the resource management layer of Apache Hadoop. This technology is essential for managing and scheduling resources across clusters. YARN allows multiple data processing engines such as real-time streaming and batch processing to handle data stored in a single platform, which makes Hadoop more efficient and scalable. 

Yarn Architecture

In this article, we will explore the various components of YARN architecture, understand its operation, and discuss its features and application workflow.

What is Hadoop YARN?

YARN (Yet Another Resource Negotiator) is a key component of Apache Hadoop, which is designed to manage computing resources in clusters and utilize them effectively. YARN provides a platform to execute and manage processing activities on large data sets. It consists of a master daemon known as the Resource Manager, node-specific agents called Node Managers, and containers where specific tasks are executed.

YARN improves upon Hadoop's original data processing method by separating the roles of job scheduling and resource management into different components. This separation increases efficiency and allows for more flexible data processing operations. By handling cluster resource management, YARN enables Hadoop to support more varied processing approaches and a broader range of applications, making it a versatile tool for big data solutions.

The architecture of YARN enhances scalability and cluster utilization by allowing multiple applications to run simultaneously on Hadoop. Each application has its own application master which negotiates resources from the Resource Manager and works in conjunction with the Node Manager to execute and monitor the tasks.

Why is YARN in Hadoop Used?

YARN (Yet Another Resource Negotiator) in Hadoop is used to manage resources and schedule tasks efficiently across a cluster. It enhances Hadoop's scalability and resource utilization by separating resource management from job scheduling, enabling multiple data processing engines to run concurrently. This flexibility and improved resource management make YARN crucial for modern big data processing.

YARN Architecture its Components

YARN architecture is built to improve the resource management capabilities of Hadoop by introducing several key components. Each plays a specific role in managing the cluster resources and processing tasks efficiently. Here are the main components of YARN:

1. Client

The YARN Client is the interface through which users submit jobs to the cluster. It interacts with the Resource Manager to initiate the Application Master, monitors job progress, and retrieves job status and output.

2. Resource Manager

This is the heart of the YARN architecture. It manages the use of resources across the cluster. The Resource Manager has two main components:

  • Scheduler: The Scheduler is responsible for allocating resources to various running applications based on abstract notions of cluster capacity, such as memory or CPU. It doesn't track or monitor the status of applications, which keeps it simple and scalable.
     
  • Application Manager: This manages the application lifecycle and the entire process workflow, from application start to finish. It accepts job submissions, negotiates the first container for executing the application-specific Application Master, and restarts the Application Master container on failure.

3. Node Manager

A Node Manager is a per-machine framework agent responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Resource Manager. It also manages application execution and container life-cycle management.

4. Containers

Containers execute specific tasks. When an application is launched, the Application Master negotiates container resources with the Resource Manager, which then utilizes Node Manager to launch and monitor the containers. Containers are the execution component of YARN, holding resources such as memory, CPU, and disks that are necessary to execute a task.

5. Application Master

Each application has its own instance of an Application Master. It is responsible for negotiating appropriate resource containers from the Scheduler, tracking their status, and monitoring their progress. Application Masters have the task of coordinating with the Node Manager to execute and monitor tasks.

YARN Architecture Features

The YARN architecture has several key features that make it well-suited for managing resources in a Hadoop cluster. Some of the main features of YARN include:

1. Scalability

YARN is designed to scale to large clusters with thousands of nodes & tens of thousands of tasks. It achieves this scalability by separating resource management from job scheduling & execution, allowing each component to scale independently.

2. Multi-tenancy

YARN allows multiple applications to run simultaneously on the same Hadoop cluster, with each application having its own Application Master & set of containers. This multi-tenancy enables better utilization of cluster resources & allows different teams or users to share the same cluster.

3. Resource Allocation

YARN provides a flexible & dynamic resource allocation model that allows applications to request & release resources as needed. The Resource Manager allocates resources to applications based on their requirements & the available resources in the cluster, ensuring that each application has access to the resources it needs.

4. Fault Tolerance

YARN is designed to be fault-tolerant & can handle failures at various levels of the architecture. If a task fails, the Application Master can request a new container from the Resource Manager & restart the task. If an Application Master fails, the Resource Manager can restart it on a different node.

5. Flexibility

YARN provides a flexible platform for running different types of applications on Hadoop. It supports batch processing, interactive querying, real-time streaming, & iterative algorithms, among others. This flexibility enables organizations to use Hadoop for a wide range of use cases.

6. Security

YARN provides several security features to ensure that applications & data are protected. It integrates with Kerberos for authentication & supports access control lists (ACLs) for authorization. It also provides encryption for data in transit & at rest.

7. Monitoring

YARN provides a web-based UI for monitoring the status of applications, containers, & nodes in the cluster. It also provides REST APIs for programmatic access to monitoring data, enabling integration with external tools & systems.

Application Workflow in Hadoop YARN

 

Now that we have explored the architecture & features of YARN, let's take a look at how an application is executed in a YARN environment. The application workflow in YARN involves the following steps:

1. Application Submission

The client submits an application to the YARN Resource Manager. The application submission includes the application jar file, the resource requirements for the application, & other configuration parameters.

2. Application Acceptance

The Resource Manager accepts the application submission & creates an Application Master (AM) for the application. The AM is responsible for negotiating resources from the Resource Manager & managing the application's execution.

3. Resource Negotiation

The Application Master negotiates with the Resource Manager for the resources required to run the application. The AM requests containers from the Resource Manager, specifying the resource requirements for each container (e.g. memory, CPU).

4. Container Allocation

The Resource Manager allocates containers to the Application Master based on the resource requirements & the available resources in the cluster. The containers are allocated on specific nodes in the cluster & are managed by the Node Managers on those nodes.

5. Task Execution

Once the containers are allocated, the Application Master launches the application's tasks in the containers. The tasks are executed on the nodes where the containers are located & are managed by the Node Managers on those nodes.

6. Task Monitoring

The Application Master monitors the progress of the tasks & the status of the containers. If a task fails, the AM can request a new container from the Resource Manager & restart the task.

7. Application Completion

Once all the tasks are completed, the Application Master notifies the Resource Manager that the application has finished. The Resource Manager then releases the resources allocated to the application & removes the AM from the system.

Here's a simple diagram that illustrates the application workflow in YARN:

Client -> Resource Manager -> Node Manager
  |                |               |
  |                |               |
  |                V             V
  |          Application Master -> Containers
  |                                  |
  |                                 V
  |                                Tasks
  V
Application Completion

This workflow enables YARN to efficiently manage resources & execute tasks in a Hadoop cluster. The separation of concerns between the Resource Manager, Application Master, & Node Managers allows YARN to scale to large clusters & handle a wide range of applications.

Advantages of YARN Architecture

  • Resource Utilization: YARN efficiently allocates resources among multiple applications, leading to better utilization of cluster resources and improved performance.
  • Scalability: The separation of resource management and job scheduling allows YARN to scale more effectively, handling thousands of nodes and applications.
  • Flexibility: YARN supports multiple data processing engines like MapReduce, Spark, and Tez, enabling diverse workloads to run concurrently on the same cluster.
  • Fault Tolerance: YARN can restart failed applications or containers, ensuring high availability and reliability.
  • Dynamic Resource Allocation: YARN can dynamically adjust resource allocation based on the current workload, optimizing resource usage and reducing wait times.
  • Improved Performance: By managing resources more efficiently and allowing concurrent execution of multiple frameworks, YARN improves overall job execution performance.

Disadvantages of YARN Architecture

  • Complexity: The architecture introduces additional components and interactions, making the system more complex to manage and troubleshoot.
  • Overhead: Resource management and scheduling can introduce overhead, potentially affecting performance for smaller jobs or clusters.
  • Compatibility Issues: Some older Hadoop applications and tools may not be fully compatible with YARN, requiring modifications or replacements.
  • Resource Contention: Concurrent execution of multiple frameworks can lead to resource contention and potential performance degradation if not managed properly.
  • Security Concerns: The multi-tenant nature of YARN can introduce security risks, requiring robust security policies and configurations to protect data and resources.
  • Learning Curve: Administrators and developers may face a steep learning curve when transitioning from traditional Hadoop to YARN, necessitating training and adaptation.

Frequently Asked Questions

Why is YARN better than MapReduce?

YARN separates resource management from job scheduling, allowing multiple data processing frameworks (beyond MapReduce) to run concurrently, improving resource utilization, scalability, flexibility, and overall cluster performance.

What is the difference between YARN and HDFS?

YARN manages cluster resources and schedules jobs, while HDFS (Hadoop Distributed File System) provides a distributed storage system for storing large data sets across multiple nodes.

How does YARN improve resource utilization in Hadoop?

YARN enhances resource utilization by dynamically allocating resources based on real-time demands of applications, ensuring optimal use of the cluster and reducing resource wastage.

Can existing MapReduce jobs run on YARN?

Yes, YARN is designed to be backward compatible with MapReduce, allowing existing MapReduce applications to run on YARN without any modifications.

Conclusion

In this article, we have learned about YARN architecture, its components, and the workflow of applications in a Hadoop environment. We discussed the roles of the Resource Manager, Node Manager, Application Master, and containers, and also talked about how YARN optimises resource utilization and supports multiple processing frameworks. This understanding of YARN will help you efficiently manage and scale Hadoop applications in your computing environment.

Also, check out some of the Guided Paths, Contests, Test Series, and Interview Experiences curated by top Industry Experts.

Live masterclass