Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Spark Interview Questions for Freshers
2.1.
1. Compare Apache and MapReduce.
2.2.
2. Explain how Spark runs applications with the help of its architecture.
2.3.
3. Briefly explain the different cluster managers available in Apache Spark.
2.4.
4. Mention the steps of how you can connect Spark to Apache Mesos.
2.5.
5. Explain coalescing with an example.
2.6.
6. How will you calculate the executor memory?
2.7.
7. Explain types of operations supported by RDD, in detail.
2.8.
8. Does Apache provide checkpoints? What are they?
2.9.
9. What are the two types of data for which we use checkpointing?
2.10.
10. What is the importance of Sliding Window operation?
2.11.
11. Say something on different levels of persistence in Spark.
2.12.
12. How will you compute the total count of unique words in Spark?
2.13.
13. What are accumulators? Why do we use them?
2.14.
14. Explain different MLlib tools available in Spark.
2.15.
15. Do you know what are the different data types supported by Spark MLlib?
2.16.
16. Describe the two components that allow model creation in MLlib.
2.17.
17. What can be done in case of complex data transformations?
2.18.
18. Compare Repartition and Coalesce.
2.19.
19. Define PageRank in Spark with an example.
2.20.
20. Explain the working of DAG in Spark.
2.21.
21. Write a Spark program to check whether a given keyword is there is a huge text or not.
2.22.
22. Mention some features of Spark Datasets.
2.23.
23. Can you explain how to minimize data transfers while working with Spark?
2.24.
24. What are the demerits of using Spark?
2.25.
25. What is the function of filer()?
3.
Apache Spark Interview Questions for Experienced
3.1.
26. Define piping.
3.2.
27. List down some limitations of using Apache Spark.
3.3.
28. What is the difference between reduce() and take() function?
3.4.
29. Differentiate between Spark Datasets, Spark DataFrames, and RDDs
3.5.
30. Can you explain Schema RDD?
3.6.
31: What is Resilient Distributed Dataset (RDD) in Apache Spark?
3.7.
32: Can you explain the different types of transformations in Spark?
3.8.
33: What is the role of Spark Driver?
3.9.
34: Explain Spark Streaming and how it works.
3.10.
35: How does Spark achieve fault tolerance?
3.11.
36: What is a SparkSession?
3.12.
37: Explain the concept of Lazy Evaluation in Spark.
3.13.
38: What are the advantages of using Spark over Hadoop MapReduce?
3.14.
39: Can you explain the concept of RDD persistence in Spark?
3.15.
40: What is a Parquet file in Spark?
3.16.
41: How can you minimize data serialization in Spark?
3.17.
42: What is the significance of an RDD’s partition in Spark?
3.18.
43: Explain the concept of a Spark executor.
3.19.
44: What is a stage in Spark?
3.20.
45: How does Spark handle data skew?
3.21.
46: What is the role of a SparkContext?
3.22.
47: Can you explain the functionality of coalesce in Spark?
3.23.
48: What is Spark MLlib?
3.24.
49: How does Spark use Akka?
3.25.
50: Can you explain the concept of a DataFrame in Spark?
4.
Frequently Asked Questions
4.1.
How do I prepare for a Spark interview?
4.2.
How do you answer Spark interview questions?
4.3.
How do you explain the Spark project in an interview?
4.4.
Can you pause a spark hire interview?
5.
Conclusion
Last Updated: Jun 14, 2024
Medium

Top Spark Interview Questions and Answers (2024)

Author Rupal Saluja
2 upvotes
Master Power BI using Netflix Data
Speaker
Ashwin Goyal
Product @
18 Jun, 2024 @ 01:30 PM

Introduction

In recent years, we have seen Big Data and Analytics make commendable progress. Data-driven decision-making has gained so much importance that it influences vital decisions worldwide.

If you are preparing for a Spark interview in the future and are looking for a quick guide before your interview, then you have come to the right place.

spark interview questions

This article covers Spark Interview Questions and answers ranging from basic to advanced levels based on the Spark concepts.

 

Spark Interview Questions for Freshers

 

1. Compare Apache and MapReduce.

CRITERIA

APACHE

MapReduce

Data Processing

Processes data in batches as well as real-time

Processes data only in batches.

Speed

Almost 100 times faster

Becomes slower when processing large data

Storage

Stores data in RAM, that is, in-memory storage

Stores data in HDFS

Duration

Easier to retrieve data

Takes a long time to retrieve the data

Caching

Provides caching and in-memory data storage

Data Storage is highly disk-dependent.

2. Explain how Spark runs applications with the help of its architecture.

how Spark runs applications

The diagram above will provide you with a better understanding of how Spark runs applications.

Spark Session Object in Driver Program Application coordinates the Spark Applications that are independent processes. Tasks assigned to the worker nodes are done by the Cluster Manager or Resource Manager. Tasks are assigned in such a way that there is only one task per partition.

Here, the worker node has two sections- Executor and Disk. Both the sections perform their functions, respectively. Finally, the results are sent back either to the Driver Manager or saved on the disk.

3. Briefly explain the different cluster managers available in Apache Spark.

  • Standalone Mode: Applications are submitted to Standalone Cluster-Mode, by default. In this, each application tries to use all available nodes, in FIFO order. Standalone Cluster can be either launched manually or by using provided launch scripts.
     
  • Apache Mesos: Being an open-source project, Apache Mesos can also run Hadoop applications. The advantage here can be both Dynamic partitionings, that is between Spark and varied frameworks, and Scalable Partitioning, between multiple instances of Spark.
     
  • Hadoop YARN: Spark runs on YARN as well, which is the resource cluster manager of Hadoop.
     
  • Kubernetes: It automates the deployment, scaling, and management of clusterized applications.

4. Mention the steps of how you can connect Spark to Apache Mesos.

To form a connection between Spark and Apache Mesos, follow the below steps.

  • Firstly, you need to configure Spark Driver Program so that it connects with Apache Mesos.
  • Now, you have to put the Spark Binary package in such a location that it is accessible by Mesos.
  • Then, install Spark in the same location as that of Apache Mesos.
  • Configure ‘spark.mesos.executor.home’ property to point out the location where Spark is installed.

5. Explain coalescing with an example.

Coalesce method is used to reduce the number of partitions in a DataFrame.

Let us understand this with the help of an example.

Consider an RDD having four partitions-

Partition A: 11,12

Partition B: 30, 40, 50

Partition C: 6, 7

Partition D: 9, 10

Here, the filter operation performed will remove all the multiples of 10.

Partition A: 11,12

Partition B: -

Partition C: 6, 7

Partition D: 9

Now, we can RDD has some empty partitions. It makes sense to reduce the number of partitions.

Partition A: 11,12

Partition C: 6, 7, 9

This is what the final RDD will look like after coalescing has been applied, at the end.

6. How will you calculate the executor memory?

Consider the given Cluster information:

Nodes= 10

No. of cores in each node= 16 (-1 for OS)

RAM for each node= 61GB RAM (-1 for OS)

Next, we have to identify the number of cores. The number of cores is the number of concurrent tasks an executor can run in parallel. So, the general thumb rule for optimal value is 5.

So, the number of executors=  No.of cores/concurrent tasks=   15/5= 3

No. of executors for Spark job= No. of nodes * No. of executors in each node =10 * 3= 30

7. Explain types of operations supported by RDD, in detail.

RDD, in Spark, supports two types of operations.

  1. Transformations: In Spark, to generate new RDD from the existing ones, we use Transformation Function. Existing RDD is taken as input to generate a new one or more RDDs as output.
     
  2. Actions: RDD Actions works by performing certain specific operations on an actual dataset. These are Spark operations that provide non-RDD values. Thus, no new RDD is generated, if action is triggered. 

8. Does Apache provide checkpoints? What are they?

Yes, there is an API in Apache Spark that facilitates the addition and management of checkpoints. 

The process of making Streaming Applications resilient to failures is known as Checkpointing. The data and metadata are saved into a checkpointing directory. If in case, there is any failure, Spark is capable of recovering this data and starting from wherever it stopped.

9. What are the two types of data for which we use checkpointing?

The two types of data for which checkpointing is used are-

  • Data Checkpointing: RDD is needed in some of the stateful transitions, therefore, here we save it in reliable data storage. The upcoming RDD in this case depends on the RDDs of previous batches.
  • Metadata Checkpointing: When we say Metadata, it means data about data. It includes configurations, DStream operations, and incomplete batches. Metadata is stored in fault-tolerant storage like HDFS.

10. What is the importance of Sliding Window operation?

importance of Sliding Window operation

Sliding Window operation controls the transmission of data packets among multiple computer networks.

Spark Streaming library is such a library of Spark that provides windowed computations wherein the transformation of RDDs is applied over a sliding window of data.

11. Say something on different levels of persistence in Spark.

  • DISK_ONLY: RDD partitions are stored only on the disk.
  • MEMORY_ONLY_SER: RDD is stored as serialized Java objects with a one-byte array per partition.
  • MEMORY_ONLY: RDD is stored as deserialized Java objects in the JVM. 
  • OFF_HEAP: It stores the data in off-heap memory. It works like MEMORY_ONLY_SER.
  • MEMORY_AND_DISK: It stores RDD as deserialized Java objects in the JVM.
  • MEMORY_AND_DISK_SER: It works like MEMORY_ONLY_SER, but it stores partitions not be able to fit in the memory.

12. How will you compute the total count of unique words in Spark?

To compute the total count of unique words in Spark, follow the steps mentioned below.

  • Firstly, the text file will be loaded as RDD.
  • We will use functions that will break each line into words.
  • Now, run ‘toWords’ function on each element of RDD.
  • Start converting each word into (key,value) pair.
  • Then,perform reduceByKey() action.
  • At last, print.

13. What are accumulators? Why do we use them?

Accumulators

To aggregate information across the executors, we use variables. Such variables are known as accumulators.

This information can be anything, that is, information regarding data, or its diagnosis, like how many times APIs have been called, how many records are corrupted, etc.

14. Explain different MLlib tools available in Spark.

The different MLlib tools available are:

ML Algorithms

Classification, Regression, 

Clustering and Collaborative filtering

Featurization 

Feature extraction, Transformation,

Dimensionality reduction, and 

Selection 

Pipelines

Tools for constructing, evaluating,

and tuning ML pipelines.

Persistence

Saving and loading algorithms, 

Models and pipelines.

Utilities

Linear Algebra, statistics, data 

handling.

15. Do you know what are the different data types supported by Spark MLlib?

  • Local vector- dense or sparse
  • Labeled point
  • Local Matrix
  • Distributed Matrix

16. Describe the two components that allow model creation in MLlib.

The two components of MLlib that allow model creation are:

  • Transformer: Its function is to read an old DataFrame and return a new DataFrame with the required transformation applied.
  • Estimator: Its function is to take a DataFrame, train a model and then return it as a transformer.

17. What can be done in case of complex data transformations?

complex data transformations

Spark MLlib allows you to combine multiple transformations into a pipeline that helps in complex data transformations.

The above clearly explains how the pipeline is created. The model, thus produced, can be applied to live data.

18. Compare Repartition and Coalesce.

S.No

CRITERIA

REPARTITION

COALESCE

  1.  
Increase or Decrease the number of partitionsIncrease or DecreaseOnly decrease
2. Procedure for Creation of partitionNew data partitions are created by shuffling evenly distributed data.Existing partitions are used and the data is shuffled unevenly.
3. SpeedSlowerFaster

19. Define PageRank in Spark with an example.

An algorithm in GraphX which measures each vertex in the graph is known as PageRank. For example, if any person on Facebook, Instagram, or any other social media platform has a huge number of followers, then his/her profile will be ranked higher among the others.

20. Explain the working of DAG in Spark.

DAG stands for Directed Acyclic Graph which has a set of vertices and edges. RDDs are represented by vertices and operations to be performed are represented by edges. 

Internals of Job Execution in Spark includes-

  • The driver program: It will submit the code, that is, jar files and configured dependencies to Executors for further execution. Also, it generates requests for worker nodes or Executors to perform. 
  • Cluster Manager: It allocates the resources and instructs workers to execute the job. Also, it tracks submitted jobs and reports backing the status of jobs.
  • Worker Nodes: Worker Nodes contain executors and jobs associated with each executor. These nodes work in coordination with the Cluster Manager responsible for their management.
  • SparkContext: This further contains Job, DAG Scheduler, and Task Scheduler. Job- As the action encounters, it will create jobs. DAG Scheduler- It splits the graph into stages of tasks and submits each stage as ready to the task scheduler. 
  • Task Scheduler- It launches tasks via Task Manager.

21. Write a Spark program to check whether a given keyword is there is a huge text or not.

def keywordExists(line):
  if (line.find("my_keyword") > -1):
      return 1
  return 0
lines = sparkContext.textFile("test_file.txt");
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print("Found" if sum>0 else "Not Found")

22. Mention some features of Spark Datasets.

  • Compile-time analysis
  • Faster Computation
  • Less Memory consumption
  • Query Optimization
  • Qualified Persistent storage
  • Single Interface for multiple languages

23. Can you explain how to minimize data transfers while working with Spark?

Several ways for minimizing data transfers while working with Apache Spark are as follows.

  • Using Accumulators
  • Using Broadcast Variables
  • Avoiding- ByKey operations, repartitions responsible for triggering shuffles.

24. What are the demerits of using Spark?

There are certain demerits of using Spark. Some of them are:

  • Spark takes more storage space which may give rise to certain memory-based problems afterward. 
  • Spark does not work well in multi-user environments.
  • Work should be distributed among multiple clusters, for efficient execution.
  • In-memory computations in Spark can be a hurdle to cost efficiency in the case of large data processing.
  • A problem will arise when you use a large number of small files. 

25. What is the function of filer()?

filer() function is used to develop a new RDD by selecting the various elements from the existing RDD, which passes the function argument.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Apache Spark Interview Questions for Experienced

26. Define piping.

‘pipe()’ method on RDDs provided by Apache Spark gives a chance to compose various parts of occupations that can utilize any language as per the UNIX Standard Streams. 

RDD transformation using pipe() method can be used for reading each element of RDD as String. Pipes can be manipulated as per the requirements and the results are displayed accordingly. 

27. List down some limitations of using Apache Spark.

  • Spark doesn’t have an in-built file management system. Hence, it needs to be integrated with other platforms like Hadoop.
  • Lesser number of algorithms available.
  • Spark Streaming doesn’t support record-based window criteria.
  • Higher latency but consequently, lower throughput.
  • The work needs to be distributed over multiple clusters instead of running everything on a single node.

28. What is the difference between reduce() and take() function?

take() function is an action that takes into consideration all the values from an RDD to the local node.

While reduce() function is an action that is applied repeatedly until the one value is left in the last.

29. Differentiate between Spark Datasets, Spark DataFrames, and RDDs

CriteriaSpark DatasetsSpark DataframesSpark RDDs

Representation

 of Data

Spark Datasets is a combination 

of Dataframes and RDDs with 

features like static-type safety

and object-oriented interfaces.

Spark Dataframe is a 

distributed collection of data 

that is organized into named

columns.

Spark RDDs are a distributed

collection of data without 

schema.

Optimization 

Datasets make use of catalyst

optimizers for optimization.

Dataframes also make use of 

catalyst optimizer for

optimization.

There is no built-in 

optimization engine.

Schema

Projection

Datasets find out schema

automatically using SQL Engine.

Dataframes also find the 

schema automatically.

Schema needs to be defined 

manually in RDDs.

Aggregation 

Speed

Dataset aggregation is faster 

than RDD but slower than

 Dataframes.

Aggregations are faster in 

Dataframes due to the 

provision of easy and 

powerful APIs.

RDDs are slower than both

the Dataframes and the

Datasets while performing 

even simple operations like 

data grouping.

30. Can you explain Schema RDD?

RDD consisting of row objects that have information of schema regarding the data type of each column is known as Schema RDD. It is designed to ease the life of developers while debugging and running test cases, these are wrappers around integer arrays or strings. They present the structure of the RDD, similar to the schema of relational databases. Schema RDD has basic functionalities as that of a normal RDD, along with query interfaces of SparkSQL. 

31: What is Resilient Distributed Dataset (RDD) in Apache Spark?


RDD is the fundamental data structure of Apache Spark, representing a fault-tolerant collection of elements that can be processed in parallel. RDDs can be created from data in storage or by transforming other RDDs. They offer rich functionalities like map, filter, and reduce, and they automatically recover from node failures.

32: Can you explain the different types of transformations in Spark?


In Spark, transformations are operations that produce a new RDD from an existing one. They are lazy, meaning they are not executed until an action is called. There are two types of transformations: narrow transformations, where the data required to compute the partitions of the parent RDD is entirely contained in a single partition of the parent RDD (e.g., map, filter); and wide transformations, which might require data from multiple partitions (e.g., groupByKey, reduceByKey).

33: What is the role of Spark Driver?


The Spark Driver is the program that creates the SparkContext, connecting to a given cluster manager. It is responsible for converting the user's application into tasks and scheduling them to run on executors. The driver also keeps track of the application's status and results.

34: Explain Spark Streaming and how it works.


Spark Streaming is an extension of the core Apache Spark API that allows processing of live data streams. Data can be ingested from many sources like Kafka, Flume, and HDFS, processed using complex algorithms expressed with high-level functions like map, reduce, and window, and then pushed out to file systems, databases, and live dashboards.

35: How does Spark achieve fault tolerance?


Spark achieves fault tolerance through the use of Resilient Distributed Datasets (RDDs). RDDs are immutable and their lineage information (the sequence of transformations used to build them) is stored, so if any partition of an RDD is lost, it can be recomputed from the original data using the lineage information.

36: What is a SparkSession?


SparkSession is a unified entry point for reading data in Spark. Introduced in Spark 2.0, it provides a programming abstraction called DataFrame and acts as a distributed SQL query engine. SparkSession allows a user to seamlessly mix SQL queries with Spark programs.

37: Explain the concept of Lazy Evaluation in Spark.


Lazy evaluation in Spark means that the execution will not start until an action is triggered. Transformations in Spark are lazy, meaning they do not compute their results right away. Instead, they just remember the operations to be performed and the dataset (e.g., file) to be applied.

38: What are the advantages of using Spark over Hadoop MapReduce?


Spark provides faster processing compared to Hadoop MapReduce due to in-memory processing; it stores intermediate data in memory rather than on disk. Spark is easy to use, with APIs in Java, Scala, and Python. It also provides advanced analytics capabilities like machine learning, graph processing, and streaming.

39: Can you explain the concept of RDD persistence in Spark?


RDD persistence allows an RDD to be used across multiple Spark operations. When an RDD is marked as persistent, the first time it is computed as part of an action, its data is saved in memory and reused in further actions. This can significantly improve the performance of your Spark application.

40: What is a Parquet file in Spark?


Parquet is a columnar storage file format optimized for use with big data processing frameworks like Apache Spark. It provides efficient data compression and encoding schemes with enhanced performance to handle complex nested data structures.

41: How can you minimize data serialization in Spark?


Data serialization can be minimized in Spark by using broadcast variables to cache data on each worker node rather than sending a copy of it with each task. You can also adjust the spark.serializer property to use Kryo serialization instead of the default Java serialization, which is more compact and efficient.

42: What is the significance of an RDD’s partition in Spark?


Partitions in RDD are basic units of parallelism in Apache Spark. Data in an RDD is divided into these partitions, which can be processed on different nodes of a cluster. This allows for parallel processing on a cluster, leading to faster execution.

43: Explain the concept of a Spark executor.


A Spark executor is a JVM process that runs tasks in a Spark application. Executors run on worker nodes in a Spark cluster, and they are responsible for executing the tasks assigned to them by the driver program, reporting the status of the task execution, and interacting with the storage systems.

44: What is a stage in Spark?


A stage in Spark is a physical unit of execution that is a result of a sequence of transformations on the data. The entire computation is divided into stages by Spark, where each stage contains tasks based on transformations that can be executed together.

45: How does Spark handle data skew?


Data skew in Spark can be handled by repartitioning the skewed data, choosing the right kind of partitioning strategy (like HashPartitioner or RangePartitioner), and by using salting techniques to distribute the skewed key more evenly across the partitions.

46: What is the role of a SparkContext?


SparkContext is the entry point to any Spark functionality. It represents the connection to a Spark cluster and can be used to create RDDs, accumulators, and broadcast variables on that cluster.

47: Can you explain the functionality of coalesce in Spark?


The coalesce() transformation is used to reduce the number of partitions in an RDD. It is a narrow transformation, meaning it does not shuffle data across partitions and thus results in a more efficient execution.

48: What is Spark MLlib?


MLlib is Apache Spark’s scalable machine learning library, providing a wide array of algorithms for classification, regression, clustering, collaborative filtering, and dimensionality reduction, as well as tools for model selection and tuning.

49: How does Spark use Akka?


Spark uses Akka for managing distributed computing and for coordinating the execution of tasks. Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications.

50: Can you explain the concept of a DataFrame in Spark?


A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database. It can be created from an existing RDD, from a Hive table, or from data sources. DataFrames allow for processing large amounts of data quickly and efficiently and support various data formats.

 

Frequently Asked Questions

How do I prepare for a Spark interview?

To prepare for a Spark interview, you should have a solid understanding of distributed computing concepts and experience working with Spark's APIs and ecosystem. Practice coding challenges and be prepared to discuss your previous experience with Spark projects.

How do you answer Spark interview questions?

To answer Spark interview questions, it's important to have a solid understanding of the Spark framework and its components, as well as experience using it for data processing and analysis. Be prepared to provide specific examples.

How do you explain the Spark project in an interview?

For a Spark project, you can explain how Spark runs apps with the help of its architecture. Spark applications run as independent processes coordinated by the SparkSession object in the driver program.

Can you pause a spark hire interview?

Yes, you can pause a Spark Hire interview by clicking on the "Pause" button located in the bottom right corner of the screen during the interview. This will allow you to take a break and resume the interview later.

Conclusion

Through this article, we hope that you have gained some insights on Spark Interview Questions. We hope that this will help you excel in your interviews and enhance your knowledge regarding Apache Spark and related stuff. This was the advanced level of Spark Interview Questions. For basic level Spark Interview Questions, refer to Part 1, and for intermediate level Spark Interview Questions, refer to Part 2.

Explore more interview questions related articles:

Peeps out there who want to learn more about can refer to our guided paths on Coding Ninjas Studio to learn more about DSA, Competitive Programming, JavaScript, System Design, etc. Enroll in our courses and refer to the mock test and problems available, interview puzzles, look at the interview experiences, and interview bundle for placement preparations. Do upvote our blog to help other ninjas grow.

Thank you for reading.

Until then, keep learning and keep exploring.

Live masterclass