Table of contents
1.
Introduction
2.
What is Spark ML?
3.
Features of Spark ML
3.1.
Transformers
3.2.
Estimators
4.
Code Examples In Spark ML
5.
Frequently Asked Questions
5.1.
What is Clustering in ML?
5.2.
What is supervised learning?
5.3.
What is PySpark?
6.
Conclusion
Last Updated: Mar 27, 2024
Medium

Introduction To Spark ML

Author Sajid Khan
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

The aim of this article is to provide an Introduction to Spark ML. We will also look at the features of Spark ML and understand the concept of pipelines in Spark ML. We will end this article with some examples to understand the usage of Spark ML.

 

Introduction to Spark ML

What is Spark ML?

Apache Spark is a data processing framework that is used for processing huge data sets. With Spark, we can distribute data processing across multiple computers, either on its own or with the use of other distributed computing tools. It is a very fast analytics tool for machine learning and big data processing.

PySpark is used for Spark to support Python and its functionalities.

Spark can also be used for running distributed SQL, creating data pipelines, ingesting data into a database, working with graphs, etc.

Features of Spark ML

Let’s look at some of the features that make Apache Spark so impressive to work with.

  • Lightning-fast speed
    For Big data processing, we need to process huge amounts of complex data. So, when it comes to Big data processing, organisations want a framework that can process large amounts of data at massive speed. Spark helps with that. Apps running on spark are up to 100x faster in memory and can be 10x faster on disk in Hadoop clusters.
    Spark uses something called a Resilient distributed dataset (RDD). Which provides Spark with the ability to transparently store data on memory and do read/write operations only when needed, which in turn reduces the number of disc access during data processing.
     
  • Easy-to-use
    Spark supports programming languages like Java, Python, Scala, and R. So, It allows a programmer to write scalable applications in a programming language of their choice.
    It can also be used to query data from Python, Scala, R, and SQL shells.
     
  • Real-time stream processing
    Spark is capable of handling real-time data streaming. Spark can also manipulate data in real-time with Spark Streaming. Spark Streaming can also recover your lost work and deliver the exact semantics. It also has the reusability of code for batch and stream processing.
     
  • Flexibility
    Spark can either run independently in cluster mode, or it can also run on Hadoop YARN, Kubernetes, and even in the cloud. Moreover it can access diverse data sources.
     
  • Sophisticated Analysis
    Spark offers much more than simple “map” and “reduce” operations. It also supports SQL queries and advanced analytics, including ML and graph algorithms. Spark has some very powerful libraries such as SQL, MLlib, GraphX, Spark Streaming, etc.

Related Article Apache Server


Pipelines in Spark ML

It is common in Machine Learning to run sequence of algorithms to process and learn from the data. The Pipeline consists of a set of Stages (Transformers and Estimators) that run in a specific order.

Transformers

 A transformer is an abstraction in MLlib that is used to implement the method transform(), which converts a DataFrame into another DataFrame, generally by adding one or more columns to the existing DataFrame.

Estimators

An Estimator abstracts the concept of a learning ML algorithm that fits or trains on data. The estimator implements a method fit()  that accepts a DataFrame and produces a Model.


A Pipeline consists of a sequence of stages of Transformers and Estimators. These stages run in order. The input DataFrame is transformed as it passes through each stage. The Transformer calls the transform() method at every transformer stage, and the Estimator calls the fit() method at every estimator stage.

Code Examples In Spark ML

In Spark ML we can make various kinds of machine learning models using spark.ml like logistic regression, linear regression, K-means clustering, Decision tree classifier, etc.

Let’s implement a binomial logistic regression classifier to better understand Classification models in Spark.

We can use Spark on Juypter notebook, google colab, etc. Here we are using google colab for Spark. 

To use spark on google colab, first, open a new notebook on colab and then write this line of command.

!pip install pyspark

Now our google colab notebook is all set to use Spark for python.

To start, we will first build a new spark session.

! pip install pyspark # installing pyspark for Spark ML
from pyspark.sql import SparkSession # starting the spark Session for loading data into dataframes
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logistic regression').getOrCreate()
You can also try this code with Online Python Compiler
Run Code

 

Now we will get the sample data to perform logistic regression from the dataset and download the .txt file.

Now we will load the data, to this session of Spark.

df = spark.read.format("libsvm").load("sample_libsvm_data.txt")
You can also try this code with Online Python Compiler
Run Code

We are ready to implement the logistic regression.

from pyspark.ml.classification import LogisticRegression

logreg= LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fitting the data in the logistic regression model
Model = logreg.fit(df)

# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(Model.coefficients))
print("Intercept: " + str(Model.intercept))
You can also try this code with Online Python Compiler
Run Code

 

Frequently Asked Questions

What is Clustering in ML?

Clustering is a Machine Learning technique that involves the grouping of data points.

What is supervised learning?

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. 

What is PySpark?

It is an interface for Apache Spark in Python. By using PySpark, we can write Python-like commands to analyze data in a distributed processing environment.

Conclusion

In this article, we got a brief introduction to Spark ML. We looked into different feature extractors in Spark ML. We also saw how to use Spark ML for Classification models, Regression models, and Clustering models.

Also, check out some of the Guided PathsContests, and Interview Experiences to gain an edge only on Coding Ninjas Studio.

 

Thankyou

 

Live masterclass