Apache Spark Architecture - Naukri Code 360

Introduction

Have you wondered how we effectively process all the massive data generated by us on a daily basis? Well, the answer is data processing engines. One such engine is Apache Spark. Spark is an open-source real-time data processing engine used by developers working in the field of big data.

In this article, we will discuss the architecture of Apache Spark with detailed working of each component. So buckle up your seatbelt, and let's start learning.

What is Apache Spark?

Apache Spark is an open-source real-time data processing framework for big data analytics. It is a highly demanded tool used by developers or data scientists working with big data. It is used to process large amount of all kinds of data, such as structured, unstructured, and semi-structured.

It is an alternative to Hadoop MapReduce and Apache Sqoop. It uses in-memory cluster computing to increase the processing speed of our app. It provides fast processing power compared to MapReduce with support for various languages such as Python, R, Java, and Scala.

Also see, Recursive Relationship in DBMS

Spark Abstraction Layers

The architecture of Spark is based on two main abstraction layers, which are given below.

Resilient Distributed Dataset (RDD)
Directed Acyclic Graph (DAG)

Let us discuss these two layers before starting to learn about Spark architecture.

Resilient Distributed Dataset (RDD)

RDD is an essential building block of Spark. It is a key data computation tool and an interface for immutable data. It is a data structure that is useful to recompute data in case of failure. RDD includes all sorts of Python, Scala, or Java objects for users to use.

Each dataset in RDD is divided into logical partitions. These partitions are stored and processed on various machines of a cluster.

The concept of RDD in Spark is to achieve a fast and more efficient MapReduce operation.
Using RDDs, one can perform two types of operations:

Transformations: These are the set of operations that are applied to create a new RDD from the existing RDDs. It takes one RDD as an input and produces one or more RDD as an output.
Actions: Actions are the operations that trigger a Spark job to perform computations on the dataset. After the computations are done, the result is sent to the Spark driver program.

Directed Acyclic Graph (DAG)

In Spark, DAG is the logical flow of plans that need to be applied to data to achieve the desired outcome. The DAG can be broken down as follows:

Directed: It means a direct connection from one node to another.
Acyclic: It means that there are no cycles or loops present. Whenever a transformation occurs, it cannot return to its earlier state.
Graph: A graph is a combination of vertices and edges. Here vertices represent the RDDs, and the edges, represent the operation to be applied to the RDDs.

DAG helps Spark to optimize executions, achieve parallelism, and provide fault tolerance.

What is Apache Spark Architecture?

Apache Spark has a structured architecture that is integrated with various libraries and extensions. Spark follows a master-slave architecture where all the components and layers of Spark are loosely coupled. The Spark cluster consists of a single master and multiple slaves.

Master Node: The master node contains the driver program which executes our apps. When the driver program executes, it calls the original program of the app and generates a Spark Context. Spark Context includes all the basic functions. You can assume Spark Context as a gateway to all Spark's functionality. Every task that you perform on Spark always goes through the Spark Context.

Cluster Manager: The cluster manager works with the Spark Context and manages the execution of various jobs inside the cluster. The cluster manager is the resource manager of Spark apps. Its task is to allocate resources and schedule tasks across all the worker nodes. It can be Spark's standalone cluster manager or some other third-party cluster manager such as Apache Mesos or Hadoop Yarn.

Now, the job gets broken down into smaller jobs and distributed to all the worker nodes. These worker nodes can also process the RDDs created in the Spark Context, and results can also be cached.

Worker Node: The worker nodes act as a slave and execute various tasks assigned to them by the master node. The Spark Context breaks a job into multiple tasks and distributes them to the worker node. The tasks then work on the partitioned RDD, and after collecting the results of various operations, they are returned to the main Spark Context. The executor’s job is to run all the tasks and store the data during execution. All the cache is stored in the Spark executor’s memory.

By default, Spark has a combination of on-heap and off-heap memory for caching the data. The on-heap memory is part of JVM’s heap space, and it is used to store objects, data structures, and cache data. The off-heap memory is the outside portion of the executor's memory. Spark uses this memory to reduce the pressure on JVM’s heap when dealing with large data.

We can increase the task performance by raising the number of worker nodes. Thus we can divide the jobs and execute them faster.

What is the Spark Driver?

As the name suggests, the Spark driver drives the execution of Spark apps and maintains the states of the Spark Cluster. The driver coordinates all the worker nodes and oversees the task assigned to them. It creates Spark contexts which is the gateway of Spark's functionality. The Spark context works with the cluster manager to monitor the jobs working in the specific cluster.

The Spark driver must interface with the cluster manager so that he gets all the physical resources needed for the task. Thus, a Spark driver is the maintainer of the state of apps running on the cluster.

What is the Spark Executor?

The Spark executor executes the tasks which are assigned by the Spark driver. The task of Spark executor is straightforward. Spark executors should run all tasks given to them by the Spark driver. Once they complete the execution, they should report back the state (success or failure) and the results of those tasks. Every Spark app has its own process executor, which is capable of storing the data at the outset.

What is the Cluster Manager?

Every Spark application runs on a cluster of machines. These machines are managed by Spark's cluster manager. In short, the cluster manager is a platform used to run Spark applications. The cluster manager also acts as a resource allocator for all the worker nodes. It schedules the task and allocates the resources to the worker nodes as per the need. It also interacts with Spark Context to manage various jobs in the cluster.

The Spark supports below cluster managers.

Apache Mesos: It is the first open-source cluster manager that is able to handle the workload in a distributed environment. It provides efficient dynamic resource sharing and isolation.

Standalone: It is a simple, resilient cluster manager that can handle work failures. It makes cluster setup in Spark easy, and it can be easily run on Windows, Linux, or Mac.

Hadoop Yarn: The Hadoop Yarn is also referred to as MapReduce 2.0. The Hadoop ecosystem heavily relies on YARN which can handle resources efficiently.

Kubernetes: Kubernetes is an open-source framework that can deploy, scale, and manage containerized apps.

Spark Architecture Applications

Data Processing: Spark architecture is widely used for processing large-scale datasets efficiently, leveraging distributed computing across a cluster of nodes.
Real-Time Analytics: It enables real-time analytics on streaming data, supporting applications such as fraud detection, recommendation systems, and monitoring.
Machine Learning: Spark provides robust libraries (e.g., MLlib) for scalable machine learning, empowering applications in predictive modeling, classification, clustering, and recommendation.
Graph Processing: Spark GraphX facilitates graph processing tasks, enabling applications in social network analysis, network security, and recommendation systems.
Batch Processing: It supports batch processing of data with fault tolerance, making it suitable for ETL (Extract, Transform, Load) operations, data warehousing, and batch analytics.

Different Modes of Execution in Spark

In Spark, execution modes decide where to physically locate your app's resources when you run your app. Below are the three modes you can choose from:

Cluster Mode
Client Mode
Local Mode

Let us now discuss each one of them in detail.

Cluster Mode

In this mode, the Spark apps run on clusters of machines, each having multiple cores. A pre-compiled R script, JAR, or Python script is submitted to the cluster manager by the user. Once done, the driver process is launched on a worker in the cluster by the cluster manager. This mode helps in distributed processing and scalability for large data.

Client Mode

Client Mode is nearly the same as cluster mode, except that in client mode, the Spark driver remains on the client machine that submitted the application. In this mode, the driver program runs on the client machine instead of one of the cluster machines. This mode provides more flexibility for managing and debugging Spark apps.

Local Mode

Local Mode is quite different from the previous two modes. It runs the whole Spark application on a single machine with a single JVM. The whole Spark app, including the executors and driver programs, runs on the same machine.

Regardless, this mode is not fit for running production applications.

Frequently Asked Questions

Which are some popular libraries and extensions available for Apache Spark?

Apache Spark includes MLlib for machine learning, Spark Streaming for real-time streaming, GraphX for graph processing, and Spark SQL for structured data queries.

Which language is used in the development of Spark?

The primary language used in the development of Apache Spark is Scala, although Spark also provides APIs in other languages such as Java, Python, and R to facilitate broader adoption and integration with different programming ecosystems.

What is a task in Spark architecture?

In Spark architecture, a task refers to a unit of work that is executed on a partition of data within a Spark job. Tasks are the smallest executable units within a Spark application, responsible for processing data and producing results.

What is DStream?

In Spark, DStream is a continuous stream of Resilient Distribution Datasets (RDD).

Conclusion

This article discusses the architecture of Apache Spark with a detailed explanation of its components. We discussed the working of Apache Spark and its various execution modes. We hope this blog has helped you enhance your knowledge about the architecture of Apache Spark. If you want to learn more, then check out our articles.

Refer to our Guided Path to upskill yourself in DSA, Competitive Programming, JavaScript, System Design, and many more! If you want to test your coding ability, you may check out the mock test series and participate in the contests hosted on Code360!

But suppose you have just started your learning process and are looking for questions from tech giants like Amazon, Microsoft, Uber, etc. In that case, you must look at the problems, interview experiences, and interview bundles for placement preparations.

However, you may consider our paid courses to give your career an edge over others!

Happy Learning!