Introduction
Spark is an open-source, distributed computing system designed for fast and flexible data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Key features include:
- Speed
- Ease of Use
- Versatility
- Unified Engine
Spark's flexibility and performance make it a popular choice for big data applications and analytics.
If you are preparing for a Spark interview in the future and are looking for a quick guide before your interview, then you have come to the right place.
This blog consists of 30 Spark Interview Questions and is divided into two parts.
- Spark Interview Questions for Freshers
- Spark Interview Questions for Experienced
Now, let us start with some important Spark Interview Questions for Freshers.
Spark Interview Questions for Freshers
1. Mention in what terms Spark is better than MapReduce and how?
- Speed- up to 100 times faster
- In-memory data caching
- Excellent performance in iterative jobs
- No dependency on Hadoop
- Commendable ML applications
2. What are the three main categories of components Spark Ecosystem comprises?
The three main categories of components are:
- Language support: Java, Python, Scala, and R
- Core components: Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and GraphX
- Cluster Management: Standalone Cluster, Apache Mesos, and YARN
3. Briefly explain the core components of the Spark Ecosystem.
- Spark Core: It is the base engine for large data processing.
- Spark Streaming: It processes real-time streaming data.
- Spark SQL: It facilitates relational processing.
- GraphX: It is used for graphs and their computation.
- MLlib: It performs machine learning.
4. What are the key features of Spark?
Some most prominent features of Spark are-
- Dynamicity in Nature
- Lazy Evaluation
- In-memory computing
- Reusability
- Real-Time Stream processing
- Fault Tolerance
5. Do you have any idea about lazy evaluation?
Lazy Evaluation is a functionality of Spark which facilitates the creation of an RDD out of an existing RDD or a data source. This ensures that there is no unnecessary usage of memory or CPU occurs, especially when handling Big Data.
6. How can you handle accumulated Metadata in Spark?
One way is to set the ‘spark.cleaner.ttl’ parameter to trigger automatic cleanups.
Another way is to simply split the jobs into batches and write intermediate results to disk.
7. What do you know about repartitions?
Repartition can increase or decrease the number of data partitions. It helps in creating new data partitions and performing a full shuffle of already distributed data. It is slower than coalesce because it internally calls coalesce with the shuffle parameter.
8. Do you have knowledge of the Parquet file?
Parquet file is a columnar file format and is considered to be one of the best Big Data analytics formats. It is supported by many other data processing systems other than Spark SQL.
9. What are the advantages of the Parquet file?
Some advantages are-
- Limits I/O operations
- Fetch specific columns as per need
- Consumes less space
- Better organized data
10. Are you familiar with the term ‘Shuffling’?
The process of shuffling data, that is, redistributing it among partitions leads to data movement. It is an expensive operation because sometimes, the data moves between executors and sometimes, between worker nodes in a cluster.
11. When does Shuffling occur? What operations do Spark Shuffling trigger?
Shuffling occurs while joining two tables or while performing transformation operations.
Transformation Operations are triggered by Spark Shuffling. Some of these are gropByKey(), reducebyKey(), join(), groupby() etc.
12. Can you say something about two important compression parameters of Shuffling in Spark?
The two compression parameters Shuffling has are:
- spark.shuffle.compress - This parameter checks whether the engine would compress shuffle outputs or not.
- spark.shuffle.spill.compress - This parameter decides whether to compress intermediate spill files or not.
13. What file systems are supported by Apache Spark?
- HDFS, that is, Hadoop Distributed File System
- LFS, that is, Local File System
- Amazon S3
- HBase
- Cassandra
14. What is the purpose of coalescing in Spark?
Coalescing can only reduce the number of data partitions. It makes use of existing partitions to reduce the amount of shuffled data. It is faster than repartition.
15. What is Spark Core? What are the various functionalities supported by it?
Spark Core is the fundamental unit of the project in Spark. It can be said as an engine for parallel and distributed processing.
Spark Core facilitates various functionalities, such as task dispatching, scheduling, I/O operations, etc.