Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Last Updated: Mar 27, 2024
Difficulty: Medium

PySpark Tutorial

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Speaker
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM

Introduction

Data processing is the manipulation and transformation of data, typically using computer software, to extract meaningful insights and information. For data of huge sizes and various varieties, we have many platforms and tools, one of them being PySpark.

PySpark Tutorial

In this PySpark tutorial, we will understand the basics of PySpark and how to use it to process and analyze big data.

What is Apache Spark?

Apache Spark is an open-source computing system that is distributed and used for processing data sets on a large scale. 

It allows writing programs in languages like Java, Scala, and Python. It also provides a range of APIs to work on structured and unstructured data like RDDs, Dataframes, Datasets, and SQL.

Spark is highly scalable and fault-tolerant, making it possible to process large data sets across a distributed cluster of machines quickly and efficiently. It also includes several libraries for machine learning, graph processing, and stream processing.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

What is PySpark?

Pyspark is a powerful tool for data analytics and machine learning. It's an integrated environment for working with large-scale datasets built on Apache Spark. 

PySpark provides rich APIs in Python and R languages, making it easy for users familiar with these languages to get started with big data analytics immediately. 

In addition, it allows you to write code using standard SQL queries, so you don't have to learn any new syntax when switching between different toolsets like Hive or Piggybank.

What is PySpark used for?

The PySpark is used for:-

  • PySpark is a Python library for distributed data processing and analytics
  • Enables operations like data manipulation, machine learning, and graph processing on large datasets
  • Integrates with Apache Spark, a leading open-source distributed computing system
  • User-friendly API for Python developers to leverage Spark's capabilities
  • Valuable for tasks requiring efficient processing and analysis of vast amounts of data
  • It is widely used in data science, data engineering, and other domains dealing with large-scale datasets

PySpark Architecture

PySpark's architecture is designed for distributed data processing. It revolves around key components like the Driver Program, which manages the execution of tasks, and the SparkContext, serving as a connection to the cluster. The Cluster Manager allocates resources to applications, while Executors run computations and store data. 

Tasks process data in partitions, and Resilient Distributed Datasets (RDDs) serve as the primary data abstraction. The DAG Scheduler converts logical plans into physical ones, and the Task Scheduler assigns tasks to available executors. Additionally, caching and persistence enhance performance, while shared variables like Broadcast Variables and Accumulators aid in parallel operations. 

This architecture ensures efficient parallel processing and fault tolerance, making PySpark ideal for big data applications.

Setting up PySpark

Before we dive into PySpark, let’s start our PySpark tutorial by setting up our environment. Here are the steps to follow: 

Step 1: Install Java

PySpark requires Java to be installed on your system. You can download it from the official website and follow the installation instructions. 

Step 2: Install Python

PySpark requires installing Python on your system. You can download it from the official website and follow the installation instructions. 

Step 3: Install Spark

You can download Spark from the official website and follow the installation instructions. 

Step 4: Set up PySpark

Once you have installed Spark, you can install PySpark using pip or another package manager. Make sure to set the SPARK_HOME environment variable to the location of your Spark installation.

PySpark Installation on Windows

Follow the given step to install PySpark on Windows:-

Step 1: Install Java: Download and install the latest version of Java Development Kit (JDK) from the official Oracle website.

Step 2: Set JAVA_HOME Environment Variable: Set the JAVA_HOME environment variable to the JDK installation path.

Step 3: Install PySpark: Open a command prompt or terminal. Use the following command to install PySpark using pip:

pip install pyspark

 

Step 4: Verify Installation: Open a Python shell or create a Python script.

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("MyApp").setMaster("local")
sc = SparkContext(conf=conf)

 

If the above code executes without errors, PySpark is successfully installed.

Key features of PySpark

Some key features of PySpark include:

  • Distributed Data Processing: PySpark enables processing of large datasets by distributing the workload across a cluster of machines
     
  • Resilient Distributed Datasets (RDDs): RDDs are the fundamental data abstraction in PySpark, providing fault-tolerant distributed collections of objects
     
  • Fault Tolerance: PySpark handles failures gracefully, ensuring that tasks are re-executed on other nodes in case of failures
     
  • In-Memory Processing: It leverages in-memory caching to speed up iterative algorithms and interactive data analysis
     
  • Wide Range of Libraries: PySpark integrates with a variety of libraries and tools for tasks like machine learning (MLlib), graph processing (GraphFrames), and SQL queries (Spark SQL)

PySpark Basics

Now that we have set up our environment let's move on to this PySpark tutorial and explore some of the basics of PySpark.

1. RDDs

RDD stands for Resilient Distributed Datasets. It is the fundamental data structure in Spark that is a representation of an immutable distributed collection of objects. RDDs are created from data stored in Hadoop Distributed File System (HDFS), local file system, or any other storage system that Hadoop supports.

Given below is an example showing how to create an RDD:

Code in Python

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

 

This creates an RDD from a Python list and distributes it across the Spark cluster.

2. PySpark APIs

SparkContext

The SparkContext can be described as the entry point to the PySpark API. It represents the connection to a Spark cluster and is used to create RDDs (Resilient Distributed Datasets). Once you have an RDD, you can use it to perform operations on your data, such as filtering or aggregating (summing). 

Here is an example of how to create a SparkContext.

Code in Python

from pyspark import SparkContext
sc = SparkContext("local", "PySpark Tutorial")


This creates a SparkContext that runs locally on our machine.

SQLContext

SQLContext is a class in Apache Spark used to establish a connection to data sources and perform SQL queries on distributed datasets. 

It provides a programming interface for working with structured or semi-structured data, for example, CSV, JSON, and Parquet files, and data stored in external databases like MySQL, Oracle, and PostgreSQL. 

With SQLContext, you can easily read and write data to and from these sources, perform data preprocessing and cleaning, and analyze and visualize data using SQL queries and data frames.

3. Transformation

Transformations are operations performed to create a new RDD from an existing one. They are lazy, meaning they don't compute the result immediately but instead create a DAG (Directed Acyclic Graph) representing the computation. Spark evaluates the DAG only when an action is called.

Here are some examples of transformations.

Code in Python

rdd = sc.parallelize([1,2,3,4,5])
a = rdd.filter(lambda x: x % 2 == 0)
b = rdd.map(lambda x: x * 2)
print(a.take(5))
print(b.take(5))

 

Output

[2, 4]
[2, 4, 6, 8, 10]

4. Actions

Actions trigger the computation of the DAG and return a result to the driver program or write data to an external storage system. Here are some examples of actions.

Code in Python

rdd = sc.parallelize([1,2,3,4,5])
rdd.count()
rdd.collect()

 

Output

5
[1, 2, 3, 4, 5]

5. Pyspark Libraries

Following are two major PySpark libraries which are used commonly - PySpark ML and PySpark SQL.

PySpark ML

PySpark ML is a library for machine learning built on top of PySpark. It provides an API to build and train various models using popular algorithms such as Random Forest, Gradient Boosted Trees, K-Means clustering, etc. You can also use this library to convert RDDs into DataFrames or vice versa using Pandas UDFs (User Defined Functions).
For example, you could create a pipeline to filter out missing values before training a model on the remaining features:

from pyspark import SparkContext
sc = SparkContext("local", "test") # Create an ML Pipeline with two transformations and one action
pipeline = Pipeline(transformation1=lambda df: df[["col1"]].dropna(), transformation2=lambda df: df[["col2"]].sample(with replacement=False), action1=lambda x: x)


PySpark SQL

PySpark SQL is a Spark module for structured data processing. It provides a programming interface to work with structured data using Spark. It supports SQL queries, DataFrame API, and SQL functions.

Here is an example of how to create a DataFrame.

Code in Python

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()

 

Output

PySpark SQL output

This creates a data frame from a Python list and displays it.

6. Pyspark Application Development

Pyspark applications are built around the concept of SparkContext. 

A SparkContext is an object that represents your connection to a cluster and allows you to use Spark functionality in your Python code. 

You can create one using either the static spark() function or by importing pyspark as:

import pyspark
spark = SparkSession .builder() .appName("MyApp") .master("local[*]") .config("spark.driver.memory", "2g") .getOrCreate()

Advantages of Pyspark

Here are some of the advantages of Pyspark.

  • PySpark can scale to handle large datasets and can run on a cluster of computers, which makes it ideal for processing big data.
     
  • PySpark is designed to process data in parallel, which can significantly speed up data processing. 
     
  • PySpark also uses in-memory processing, which can further improve performance. 
     
  • PySpark is built on top of Apache Spark, which is written in Scala. However, PySpark provides an easy-to-use Python API that allows Python developers to write Spark applications using familiar Python syntax. 
     
  • PySpark supports a wide range of data sources, including Hadoop Distributed File System (HDFS), Apache Cassandra, Apache HBase, and Amazon S3. PySpark can also be integrated with other big data tools, such as Apache Kafka and Apache Flume.
     
  • PySpark provides a comprehensive machine learning library called MLlib, which includes a wide range of algorithms for classification, regression, clustering, and collaborative filtering. MLlib also includes tools for feature extraction, transformation, and selection. 
     
  • PySpark supports streaming data processing through Spark Streaming, which allows you to process real-time data streams in a scalable and fault-tolerant manner. 

Frequently Asked Questions

What is the difference between a transformation and an action in PySpark?

Transformations are operations that create a new RDD from an existing one. At the same time, actions trigger the computation of the DAG and return a result to the driver program or write data to an external storage system.

What programming language is PySpark written in?

PySpark is built on top of Apache Spark, written in Scala. However, PySpark provides a Python API that allows developers to write Spark applications using Python syntax.

What is PySpark used for?

PySpark is used for big data processing and analysis. It allows developers to process large datasets in a distributed computing environment and provides various tools for data manipulation, machine learning, and real-time data processing.

Is PySpark suitable for small datasets?

PySpark is designed for big data processing and may not be the best choice for small datasets. However, PySpark can be used for small datasets if the data processing requirements are complex and require distributed computing.

Conclusion

PySpark is a powerful big data processing and analysis framework that provides a Python API for interacting with Spark. In this PySpark tutorial, we explored the basics of PySpark, including the SparkContext, RDD, transformations, actions, and PySpark SQL. 

With this knowledge, you can start building your own PySpark applications and efficiently processing large amounts of data.  Keep practicing and experimenting with different features to fully leverage the power of PySpark.

Check out the following for a better understanding:


Happy Learning!!

Topics covered
1.
Introduction
2.
What is Apache Spark?
3.
What is PySpark?
4.
What is PySpark used for?
5.
PySpark Architecture
6.
Setting up PySpark
7.
PySpark Installation on Windows
8.
Key features of PySpark
9.
PySpark Basics
9.1.
1. RDDs
9.2.
2. PySpark APIs
9.3.
3. Transformation
9.4.
4. Actions
9.5.
5. Pyspark Libraries
9.6.
6. Pyspark Application Development
10.
Advantages of Pyspark
11.
Frequently Asked Questions
11.1.
What is the difference between a transformation and an action in PySpark?
11.2.
What programming language is PySpark written in?
11.3.
What is PySpark used for?
11.4.
Is PySpark suitable for small datasets?
12.
Conclusion