Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Intermediate Level Spark Interview Questions
3.
Frequently Asked Questions
3.1.
What are the datatypes supported by SparkSQL and the DataFrame?
3.2.
What does the term ‘Akka’ means?
3.3.
Why is it not necessary to install Spark on all nodes of the YARN cluster?
4.
Conclusion
Last Updated: Mar 27, 2024

Spark Interview Questions-2

Author Rupal Saluja
0 upvote

Introduction

Spark, which was Hadoop’s subproject, was made open-source under the BSD license, in 2010. In 2013, it was donated to Apache Software Foundation and in and after 2014, Spark was used for all major projects taken under Apache.

If you are preparing for a Spark interview in the future and are looking for a quick guide before your interview, then you have come to the right place.

The whole series consists of 90 Spark Interview Questions and is divided into three parts. This blog is Part 2, which covers the next 30 Spark Interview questions for the Intermediate level. You can refer to Part 1 for the Beginner level and Part 3 for the Advanced level Spark Interview questions. 

Now, let us start with some important Spark interview questions at the intermediate level.

Recommended Topic, Pandas Interview Questions

Intermediate Level Spark Interview Questions

1. Mention in what terms Spark is better than MapReduce and how?

  • Speed- up to 100 times faster
  • In-memory data caching
  • Excellent performance in iterative jobs
  • No dependency on Hadoop
  • Commendable ML applications

 

2. What are the three main categories of components Spark Ecosystem comprises?

The three main categories of components are:

  • Language support: Java, Python, Scala, and R
  • Core components: Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and GraphX
  • Cluster Management: Standalone Cluster, Apache Mesos, and YARN

 

3. Briefly explain the core components of the Spark Ecosystem.

  • Spark Core: It is the base engine for large data processing.
  • Spark Streaming: It processes real-time streaming data.
  • Spark SQL: It facilitates relational processing.
  • GraphX: It is used for graphs and their computation.
  • MLlib: It performs machine learning.

 

4. What are the key features of Spark?

Some most prominent features of Spark are-

  • Dynamicity in Nature
  • Lazy Evaluation
  • In-memory computing
  • Reusability
  • Real-Time Stream processing
  • Fault Tolerance

 

5. Do you have any idea about lazy evaluation?

Lazy Evaluation is a functionality of Spark which facilitates the creation of an RDD out of an existing RDD or a data source. This ensures that there is no unnecessary usage of memory or CPU occurs, especially when handling Big Data.

6. How can you handle accumulated Metadata in Spark?

One way is to set the ‘spark.cleaner.ttl’ parameter to trigger automatic cleanups.

Another way is to simply split the jobs into batches and write intermediate results to disk.

7. What do you know about repartitions?

Repartition can increase or decrease the number of data partitions. It helps in creating new data partitions and performing a full shuffle of already distributed data. It is slower than coalesce because it internally calls coalesce with the shuffle parameter.

8. Do you have knowledge of the Parquet file?

Parquet file is a columnar file format and is considered to be one of the best Big Data analytics formats. It is supported by many other data processing systems other than Spark SQL.

9. What are the advantages of the Parquet file?

Some advantages are-

  • Limits I/O operations
  • Fetch specific columns as per need
  • Consumes less space
  • Better organized data

 

10. Are you familiar with the term ‘Shuffling’?

The process of shuffling data, that is, redistributing it among partitions leads to data movement. It is an expensive operation because sometimes, the data moves between executors and sometimes, between worker nodes in a cluster. 

11. When does Shuffling occur? What operations do Spark Shuffling trigger?

Shuffling occurs while joining two tables or while performing transformation operations.

Transformation Operations are triggered by Spark Shuffling. Some of these are gropByKey(), reducebyKey(), join(), groupby() etc.

12. Can you say something about two important compression parameters of Shuffling in Spark?

The two compression parameters Shuffling has are:

  • spark.shuffle.compress - This parameter checks whether the engine would compress shuffle outputs or not.
  • spark.shuffle.spill.compress - This parameter decides whether to compress intermediate spill files or not.                 

13. What file systems are supported by Apache Spark?

  • HDFS, that is, Hadoop Distributed File System
  • LFS, that is, Local File System
  • Amazon S3
  • HBase
  • Cassandra

 

14. What is the purpose of coalescing in Spark?

Coalescing can only reduce the number of data partitions. It makes use of existing partitions to reduce the amount of shuffled data. It is faster than repartition.

15. What is Spark Core? What are the various functionalities supported by it?

Spark Core is the fundamental unit of the project in Spark. It can be said as an engine for parallel and distributed processing.

Spark Core facilitates various functionalities, such as task dispatching, scheduling, I/O operations, etc.

16. Can you convert a spark RDD into a DataFrame?

There are two ways to convert a spark RDD into a DataFrame. These are-

  • Using helper function- toDF
  • Using SparkSession.createDataFrame

 

17. What is Lineage Graph?

We know that there will always be dependencies between the existing RDD and the new RDD. All such dependencies are recorded in a graph. Such a graph is called Lineage Graph.

Data Replication in memory is not supported by Spark. So, this graph helps recover the lost data from the lost persistent RDD.

18. What is caching in Spark Streaming?

Caching, also known as, Persistence, is an optimization technique for Spark computations. In DStreams, stream data is allowed to be cached in memory. If data is to be computed multiple times, caching becomes very useful. ‘persist()’ method is used to implement caching in Spark Streaming.

19. Do you know what broadcast variables are?

Variables, which are read-only in nature and are present in-memory cache of every machine are known as the broadcast variables. They can be used to give every node a copy of a large input dataset. Thus, helps eliminate the necessity to ship copies of variables for each task, and makes the process of processing data faster.

20. What do you know about Spark SQL?

A module of Spark that facilitates structured data processing is known as Spark SQL. It integrates relational processing with Spark’s functional programming, making it possible to encapsulate SQL queries with code transformations. Also, it provides programming abstraction in the form of DataFrames.

21. Mention some features of Spark SQL.

  • Facilitates a high level of integration between SQL and any other language code
  • Standard connectivity through JDBC and ODBC
  • Uniform Data Access from varied data sources
  • Allows full compatibility with current Hive Data, queries and UDFs
  • Cost-based optimizer and code generator

 

22. Can you name some commonly-used Spark Ecosystems.

  • Spark SQL- Developing
  • GraphX- Generation & Computation of graphs
  • MLlib- Machine Learning Algorithms
  • SparkR- Promote R Programming Language
  • Spark Streaming- Process live data streams

 

23. What does a Spark Engine do?

A Spark engine is responsible for scheduling, distributing, and monitoring the data application across the cluster.  Spark Engine is used to run mappings in Hadoop clusters. It is suitable for wide-ranging circumstances. It includes SQL batch and ETL jobs in Spark, streaming data from sensors, IoT, ML, etc.

24. Briefly describe the deploy modes in Apache Spark.

The two deploy modes in Apache Spark are-

  • Client Mode: In this, the driver component run on the machine from which the job is submitted.
  • Cluster Mode: In this, the driver component will not run on the local machine from which the job is submitted.

 

25. Explain the difference between Transformations in Spark and Actions in Spark.

Transformations are functions applied to RDDs which results in another RDD. It will not execute until an action occurs.

While Actions are operations that help to work with an actual data set. These are RDD operations giving non RDD values.

26. What do you understand by Pair RDD?

Some special operations on RDDs are performed using key/value pairs. Such RDDs are referred to as Pair RDDs. ‘reduceByKey()’ method collects the data based on each key and the ‘join()’ method combines different RDDs together based on the elements having the same key.

27. What is executor memory in a Spark application?

For a Spark executor, each application in Spark has the same fixed heap size as well as a fixed number of cores. The heap size here is what we say executor memory. It is controlled with spark.executor.memory property in the –executor-memory flag

28. Explain in brief, Streaming implementation in Spark.

Spark Streaming is implemented, to process real-time streaming data. Data from various sources, such as Flume, and HDFS is streamed and then processed to file systems, live dashboards, and databases. Thus, it is a meaningful addition to the core Spark API.

29. Do we have any API For graph implementation in Spark?

GraphX is the Spark API for graphs and their parallel computation. There is not only a set of fundamental operators, such as subgraph, joinVertices, mapReduceTriplets, but also an optimized variant of the Pregel API. In addition to this, there is a growing collection of graph algorithms and builders to simplify graph analysis.

30. How would you implement Machine Learning in Spark?

We use ‘MLlib’ library provided by Spark to implement Machine Learning in it. There are some common learning algorithms and use cases, such as clustering, regression filtering, dimensional reduction, etc. which make the implementation of Machine Learning easy and scalable.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Frequently Asked Questions

What are the datatypes supported by SparkSQL and the DataFrame?

We can use any of the datatypes from ArrayType and MapType or ObjectType and StructType.

What does the term ‘Akka’ means?

Akka is basically used for scheduling. Requests for tasks are received from workers after they have registered. It facilitates messaging between workers and masters.

Why is it not necessary to install Spark on all nodes of the YARN cluster?

Spark can execute on top of YARN, Mesos, or any other Cluster Manager, without affecting any change to the cluster. That is why it is not essential to install Spark on every node. 

Conclusion

Through this article, we hope that you have gained some insights on Spark Interview Questions. We hope that this will help you excel in your interviews and enhance your knowledge regarding Apache Spark and related stuff. This was the intermediate level of Spark Interview Questions. For basic level Spark Interview Questions, refer to Part 1, and for advanced-level Spark Interview Questions, refer to Part 3

Recommended Readings:


For peeps out there who want to learn more about Data Structures, Algorithms, Power programming, JavaScript, or any other upskilling, please refer to guided paths on Coding Ninjas Studio. Enroll in our courses, go for mock tests and solve problems available and interview puzzles. Also, you can put your attention towards interview stuff- interview experiences and an interview bundle for placement preparations. Do upvote our blog to help other ninjas grow.

Happy Coding!

Previous article
Spark Interview Questions-1
Next article
Top Spark Interview Questions and Answers (2024)
Live masterclass