Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Hadoop Ecosystem
2.1.
Role of HDFS
3.
The Hadoop Distributed File System
3.1.
Name Node
3.2.
Data Node
3.3.
Rack Aware Placement Policy
3.3.1.
Example
4.
Goals of HDFS
5.
Pros and Cons
5.1.
Pros
5.2.
Cons
6.
Frequently Asked Questions
6.1.
What is the block size in HDFS?
6.2.
In HDFS, what is a heartbeat?
6.3.
What's the best way to figure out how many blocks We have in HDFS?
6.4.
What happens if Name Node fails during the task execution?
6.5.
In HDFS, what is a checkpoint?
7.
Conclusion
Last Updated: Mar 27, 2024

Hadoop Distributed File System(HDFS)

Introduction

Unsurprisingly, the amount of data produced is increasing. However, you might be surprised at how much it costs. Google is reported to process 63,000 search requests every second, equating to 5.6 billion daily searches and nearly 2 trillion global searches per year.

Every 1.2 years, the industry's amount of data captured and kept doubles.

We could stack all of the data created in a single day onto DVDs and go to the moon twice

Big Data management is the organisation, administration, and control of enormous volumes of structured and unstructured data. Hadoop is a Java-based open-source platform for storing and analysing large amounts of data. HDFS is where Hadoop reads and writes data.

In this article, we will explore the world of Hadoop and understand what HDFS is.

Hadoop Ecosystem

The Apache Hadoop ecosystem refers to all software components that make up the Apache Hadoop software library. It comprises both open-source projects and a wide range of additional technologies.  Some of the most popular  Hadoop ecosystem tools include HDFS, Hive, Pig, YARN, MapReduce, Spark, HBase, Zookeeper, etc.

Must Read Apache Server

Role of HDFS

HDFS is one of the most popular components of the Hadoop ecosystem. HDFS stands for Hadoop Distributed File System. It is one of the most important Apache projects and Hadoop's principal storage system. A NameNode and DataNode architecture are used in HDFS. It's a large-file-storage distributed file system that runs on commodity hardware in a cluster.

The Hadoop Distributed File System

The Hadoop Distributed File System is a data storage cluster that simplifies the management of related files across servers and is reliable, high-bandwidth, and low-cost. 

For files, HDFS is not the final destination. It's a data service that provides a unique set of required characteristics when data quantities and velocity are high.

HDFS is ideal for supporting extensive data analysis since the data is written once and then read numerous times, rather than the frequent read-writes of conventional file systems.

HDFS cluster consists of:

  • “Name Node.”
  • Multiple “Data Nodes.”

The NameNode is in charge of keeping track of where data is stored physically.

                                                             

                                                                                      Source

Name Node

HDFS works by breaking extensive data into abstract data blocks. The file metadata for these blocks is saved in a Fsimage on the NameNode local memory and includes the file name, file permissions, IDs, locations, and the number of replicas.

HDFS would be unable to locate any of the data sets dispersed among the Data Nodes if a Name Node failed. As a result, the Name Node becomes the cluster's single point of failure. We can use the Secondary Name Node or a Standby Name Node to address this problem.

Data Node

Data nodes are not intelligent, but they are flexible. Every data block is stored three times on independent Data Nodes within the HDFS cluster. The Name Node uses a rack-aware placement policy to track data nodes. 

A Data Node talks with the Name Node and accepts its instructions almost twenty times every minute. It also sends an hourly report on the condition and health of the data blocks on that node. The Name Node can ask the Data Node to generate more replicas, remove them, or reduce the number of data blocks on the node based on the information provided.

Rack Aware Placement Policy

High availability and replication are two of the primary goals of a distributed storage system like HDFS. As a result, data blocks must be dispersed among different Data Nodes and across server racks.

This prevents all data replicas from being terminated if an entire rack fails. The HDFS NameNode maintains a default rack-aware replica placement policy:

  • The first data block replica is created on the same node as the client.
  • The second replica is placed on a DataNode on a different rack at random.
  • On the same rack as the second replica, the third replica is housed in a different DataNode.
  • Any other replicas are stored on DataNodes spread around the cluster at random.

Example

                                                                 

                                                                                                     Source

"File.txt" is separated into three sections: A, B, and C. HDFS build block replicas to provide fault tolerance. The first replica of each block is placed on the nearest DataNode, the second replica on a different DataNode on the same rack, and the third replica on a separate rack.

Goals of HDFS

The Goals of the Hadoop Distributed File System are as follows:

  • Quick Recovery from Hardware Faults: Because HDFS uses a tremendous amount of commodity hardware, component failure is expected. Therefore HDFS was designed to detect and automatically recover from errors.
  • Huge Data Accommodation: HDFS is designed for applications with data sets ranging from gigabytes to terabytes in size. HDFS can scale to hundreds of nodes in a single cluster and provides high aggregate data bandwidth.
  • Hardware across Data: A specified task can be completed quickly when the calculation occurs close to the data. It minimises network traffic and boosts throughput, especially when large datasets are involved.
  • Portability: HDFS is meant to be portable across many hardware platforms and compatible with several underlying operating systems to help with adoption.

Pros and Cons

HDFS provides many features. Even though HDFS has much functionality, it falls short in a few instances. Here we will discuss some of the Pros and Cons of HDFS. 

Pros

  • Support the storing of large amounts of data.
  • Detect and respond to hardware issues rapidly.
  • Data access in real-time.
  • The consistency model is simplified.
  • Exceptional fault tolerance.
  • Hardware that is available for purchase.

Cons

  • Low-latency data access is not possible.
  • It is not recommended for storing a vast number of tiny files.
  • Modification of files is not permitted (HDFS 2.x supports the appending of the files).
  • Users can't write in parallel.

                                          

                                                                        Source

Frequently Asked Questions

What is the block size in HDFS?

The block size utilised by HDFS is typically 128 MB. As a result, an HDFS file is divided into 128 MB chunks, with each chunk residing on a distinct Data Node if possible.

 

In HDFS, what is a heartbeat?

A heartbeat is a signal sent from a Data node to a Name node to indicate that the Data node is still alive. The absence of a heartbeat in HDFS implies a problem, and the Name node and Data node cannot execute any computations.

 

What's the best way to figure out how many blocks We have in HDFS?

The fsck Hadoop command can also check the number of data blocks for a file or a block location.

 

What happens if Name Node fails during the task execution?

If the Name node goes down, the Hadoop cluster becomes unreachable and is declared dead. The data node stores actual data and follows the Name node's instructions. There can be numerous data nodes in a Hadoop file system, but only one active Name node.

 

In HDFS, what is a checkpoint?

Checkpointing is a process that requires merging the fsimage with the most recent edit log and creating a new fsimage for the name node to have the most up-to-date HDFS namespace metadata.

Conclusion

This article extensively discussed HDFS, its architecture, goals, advantages, and disadvantages.

We hope this blog has helped you enhance your HDFS and Big Data knowledge. You can learn more about Big DataBig Data vs. Data Science, and Big Data Engineers. If you liked this article, check out these fantastic articles also. 

Check out this article - File System Vs DBMS

Upvote our blog to help other ninjas grow.

Head over to our practice platform Coding Ninjas Studio to practice top problems, attempt mock tests, read interview experiences, and much more!!

 

We wish you Good Luck! Keep coding and keep reading Ninja!!

Live masterclass