Table of contents
1.
Introduction
2.
Hadoop Ecosystem Components
3.
Hadoop’s Architecture
4.
Hadoop's Characteristics
5.
Network Topology In Hadoop
6.
Frequently Asked Questions
6.1.
Which Topology is used in Hadoop?
6.2.
Why is Hadoop using MapReduce?
6.3.
In Hadoop, how is network distance calculated?
7.
Conclusion
Last Updated: Mar 27, 2024

Hadoop: Features,Components,Clusters & Topology

Author Palak Mishra
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Apache HADOOP is a framework for creating data processing applications on a distributed computing system. Like how data is stored on a local file system of a personal computer, Hadoop data is stored in a distributed file system known as the HDFS( Hadoop Distributed File System). 


HADOOP is a software framework that is free and open source. Large data sets are distributed across clusters of commodity computers, and applications built with HADOOP run on them.
Computers that are sold as commodities are inexpensive and widely available. These are primarily used to increase computational power at a low cost.                                                     

source:image

Hadoop Ecosystem Components

 

Hadoop Distributed File System

The Hadoop Ecosystem's most crucial component is Hadoop. Hadoop's primary storage system is HDFS. Hadoop distributed file system (HDFS) is a java-based file system for Big data storage that is scalable, fault-tolerant, reliable, and cost-effective.
Components of HDFS: NameNode and DataNode are the two main components of Hadoop HDFS.

 

YARN
One of the essential components of the Hadoop Ecosystem is Hadoop YARN (Yet Another Resource Negotiator). YARN is Hadoop's operating system, and it is in charge of managing and monitoring workloads. It enables multiple data processing engines to handle data stored on a single platform, such as real-time streaming and batch processing.
The following are the main characteristics of YARN:
→ Flexibility
→ Efficiency
→ Shared


MapReduce
Hadoop MapReduce is the data processing component of the Hadoop ecosystem. MapReduce is a programming framework that makes it simple to create applications that process the massive amounts of unstructured and structured data stored in the Hadoop Distributed File System. This parallel processing improves the speed and reliability of the cluster.
'MapReduce,' a Hadoop Ecosystem component, divides processing into two stages:

  • The phase of the map
  • Reduce the number of phases


Pig
It is a high-level language platform for querying and analyzing massive HDFS datasets. Pig is a Hadoop Ecosystem component that uses the PigLatin language. It has a lot in common with SQL. It loads the data, applies the necessary filters, and dumps the data in the specified format. Pig requires a Java runtime environment to execute programs. Apache Pig has the following features:

  • Extensibility
  • Opportunities for optimization
  • Handles a wide range of data
     

Hive

The Hadoop ecosystem component, Apache Hive, is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hive does three main functions: data summarization, query, and analysis.
Hive uses HQL (HiveQL), a SQL-like language. HiveQL converts SQL-like queries into MapReduce jobs, which are then executed on Hadoop.
The following are the main components of Hive:

  • Metastore – This is where the metadata is kept.
  • Driver – Manages the HQL (HiveQL) statement's lifecycle.
  • Query Compiler - Compiles HiveQL into a Directed Acyclic Graph (DAG) (DAG).
  • Hive server: Provides a user-friendly interface as well as a JDBC/ODBC server.

 

HCatalog

 It's a Hadoop table and storage management layer. HCatalog supports various Hadoop ecosystem components such as MapReduce, Hive, and Pig to read and write data from the cluster. Hive's HCatalog component allows users to store data in any format and structure.RCFile, JSON, CSV sequenceFile, and ORC file formats are supported by default in HCatalog.

HCatalog's Advantages:

  • Allows for data availability notifications.
  • HCatalog frees the user from data storage overhead with the table abstraction.
  • Make data cleaning and archiving tools more visible.


HBase

Apache HBase is a Hadoop ecosystem component that is a distributed database for storing structured data in tables with billions of rows and millions of columns. HBase is a distributed, scalable, NoSQL database built on top of HDFS. HBase allows you to read and write data in HDFS in real-time.
HBase's components are as follows:

  • Master of HBase
  • RegionServer


Thrift

It's a software framework for creating scalable cross-language services. Thrift is an RPC (Remote Procedure Call) communication interface definition language. Because Hadoop makes a lot of RPC calls, it's possible to use Apache Thrift, a Hadoop Ecosystem component, for performance or other reasons.

 

Avron

Acro is a popular data serialization system part of the Hadoop ecosystem. Avro, an open-source project, provides data serialization and data exchange services for Hadoop. These services can be used in conjunction or separately. Avro allows Big Data to exchange programs written in various languages.
Avro provides the following features:

  • Data structures that are rich.
  • Call for a remote procedure.
  • The binary data format is small and fast.
  • To store persistent data, use a container file.


Apache Mahout

Mahout is a free and open-source framework for developing scalable machine learning algorithms and data mining libraries. After data is stored in Hadoop HDFS, mahout automatically provides data science tools to find meaningful patterns in large data sets.
Mahout's algorithms are as follows:

  • Clustering
  • Filtering by consensus
  • Classifications
  • Pattern mining is a common occurrence.


Apache Drill

The Hadoop Ecosystem Component's primary goal is to process large amounts of structured and semi-structured data on a large scale. It's a distributed query engine with low latency that can scale to tens of thousands of nodes and query petabytes of data. The drill is the world's first schema-free distributed SQL query engine.
Apache Drill has the following features:

  • Extensibility
  • Flexibility
  • Schema discovery in real-time
  • Experiment with decentralized metadata


Apache Flumes

Flume collects, aggregates, and moves a large amount of data from its source to HDFS efficiently. It's a fault-tolerant and dependable system. This component of the Hadoop Ecosystem allows data to flow from a seed into the Hadoop environment. It uses a simple and extensible data model that enables online analytics. We can quickly get data from multiple servers into Hadoop using Flume.


Apache Sqoop
Sqoop is a Hadoop ecosystem component that imports data from external sources into HDFS, HBase, and Hive. It also exports Hadoop data to other outside sources. Sqoop is a relational database management system that works with Teradata, Netezza, Oracle, and MySQL.
Apache Sqoop has the following features:

  • Import mainframe sequential datasets
  • Direct import to ORC files
  • Data transfer in parallel
  • Data analysis that is quick and efficient
  • Data copies in a hurry


Zookeeper
Apache Zookeeper is a Hadoop Ecosystem component that manages configuration information, names objects, provides distributed synchronization, and provides group services. A large cluster of machines is managed and coordinated by Zookeeper.
Zookeeper's characteristics include:

  • Fast
  • Ordered


Ambari
Ambari is a management platform for provisioning, managing, monitoring, and securing Apache Hadoop clusters, part of the Hadoop ecosystem. Ambari provides a secure platform for operational control, consistent,  making Hadoop management easier
Ambari's characteristics include:

  • Installation, configuration, and management are all made easier.
  • Setup of centralized security
  • Highly customizable and extensible
  • Complete visibility into the health of the cluster


Oozie

Oozie combines multiple jobs in a logical order into a single work unit. Oozie supports Hadoop jobs for Apache MapReduce, Pig, Hive, and Sqoop and is fully integrated with the Apache Hadoop stack, using YARN as an architecture center. Oozie is also highly adaptable. Jobs can be quickly started, stopped, suspended, and rerun. It's even possible to skip or rerun a specific failed node in Oozie.
Oozie jobs are divided into two categories:
→ Oozie workflow – It is used to store and run Hadoop-based workflows.
→ Oozie Coordinator – This program runs workflow jobs according to predefined schedules and data availability.

 

Related Article Apache Server

Must Recommended Topic, procedure call in compiler design

Hadoop’s Architecture

Hadoop uses MapReduce and HDFS methods to store data and process it in a Master-Slave architecture.

  • NameNode: A NameNode represents every file and directory in the namespace.
  • DataNode: DataNode allows you to interact with the blocks and manage the state of an HDFS node.
  • MasterNode: The master node allows you to use Hadoop MapReduce to perform parallel data processing.
  • The node of slaves: The slave nodes are additional Hadoop cluster machines that allow you to store data and perform complex calculations. Furthermore, each slave node includes a Task Tracker and a DataNode. This allows the processes to be synchronized with the NameNode and Job Tracker.
     

Hadoop's Characteristics

  •  It's a good fit for extensive data analysis.Because Big Data is typically distributed and unstructured, HADOOP clusters are best suited for Big Data analysis. Less network bandwidth is consumed because processing logic (rather than actual data) flows to the computing nodes. This is known as the data locality concept, and it aids in the efficiency of Hadoop-based applications.
     
  •  Scalability- HADOOP clusters can be easily scaled to any size by adding more cluster nodes, allowing for growth in big data. Scaling also doesn't necessitate any changes to the application logic.
     
  • Tolerance to Errors- The HADOOP ecosystem includes a feature that allows other cluster nodes to replicate input data. Data processing can continue in the event of a cluster node failure by using data stored on another cluster node.
     
  • Data Locality 
  • low-cost
  • simple-to-use service.
  • Open Source
  • Distributed Processing
  • High Availability

Network Topology In Hadoop

 

 source: image

When the Hadoop cluster size grows, the topology (arrangement) of the network has an impact on performance. Aside from performance, high availability and failure handling are also essential considerations. Hadoop cluster formation makes use of network topology to accomplish this.

In most cases, network bandwidth is an important consideration when forming a network. However, because measuring bandwidth is complex, the Hadoop network is represented as a tree. The distance between nodes in the tree (number of hops) is a critical factor in forming a Hadoop cluster. In this case, the distance between two nodes equals the sum of their distances to their common ancestor.

The data centre, the rack, and the node that executes jobs make up a Hadoop cluster. A data centre is made up of racks, and each rack is made up of nodes. The amount of network bandwidth available to processes varies depending on their location. That is, the amount of bandwidth available decreases as we move away from-

  • On the same node, there are multiple processes.
  • On the same rack, there are different nodes.
  • Different racks in the same data centre house nodes.
  • Different data centres have other nodes.

Frequently Asked Questions

Which Topology is used in Hadoop?

Hadoop cluster formation makes use of network topology to accomplish this. In most cases, network bandwidth is an important consideration when forming a network.

Why is Hadoop using MapReduce?

Hadoop's MapReduce framework is used to create applications that can process large amounts of data on large clusters. It's also a programming model to process large datasets across multiple computer clusters. This application enables the storage of data in a distributed manner.

In Hadoop, how is network distance calculated?

Distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack) distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center) distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)

Conclusion

Finally, keep in mind that Hadoop will not be able to figure out your network topology for you. It does, however, assume that the network is flat—that is, that all nodes are on a single rack in a single data center—by default. This may be the case for small clusters, in which case no additional configuration is necessary.

We hope that this blog has helped you enhance your knowledge regarding the subject of Hadoop and how it can be used to process large amounts of data in this article, and if you would like to learn more, check out our other articles on MapReduce.Still, the knowledge never stops, have a look at more related articles: Hadoop, Data WarehouseMongoDB, AWS, and many more. 

A ninja never stops learning, so to feed your quest to learn and become more advanced and skilled, head over to our practice platform Coding Ninjas Studio to practice advanced-level problems. Attempt 100 SQL problems, read interview experiences, and much more!

Thank you for reading. 

Happy Coding!

Live masterclass