Features of Spark ML
Let’s look at some of the features that make Apache Spark so impressive to work with.
-
Lightning-fast speed
For Big data processing, we need to process huge amounts of complex data. So, when it comes to Big data processing, organisations want a framework that can process large amounts of data at massive speed. Spark helps with that. Apps running on spark are up to 100x faster in memory and can be 10x faster on disk in Hadoop clusters.
Spark uses something called a Resilient distributed dataset (RDD). Which provides Spark with the ability to transparently store data on memory and do read/write operations only when needed, which in turn reduces the number of disc access during data processing.
-
Easy-to-use
Spark supports programming languages like Java, Python, Scala, and R. So, It allows a programmer to write scalable applications in a programming language of their choice.
It can also be used to query data from Python, Scala, R, and SQL shells.
-
Real-time stream processing
Spark is capable of handling real-time data streaming. Spark can also manipulate data in real-time with Spark Streaming. Spark Streaming can also recover your lost work and deliver the exact semantics. It also has the reusability of code for batch and stream processing.
-
Flexibility
Spark can either run independently in cluster mode, or it can also run on Hadoop YARN, Kubernetes, and even in the cloud. Moreover it can access diverse data sources.
-
Sophisticated Analysis
Spark offers much more than simple “map” and “reduce” operations. It also supports SQL queries and advanced analytics, including ML and graph algorithms. Spark has some very powerful libraries such as SQL, MLlib, GraphX, Spark Streaming, etc.
Related Article Apache Server
Pipelines in Spark ML
It is common in Machine Learning to run sequence of algorithms to process and learn from the data. The Pipeline consists of a set of Stages (Transformers and Estimators) that run in a specific order.
Transformers
A transformer is an abstraction in MLlib that is used to implement the method transform(), which converts a DataFrame into another DataFrame, generally by adding one or more columns to the existing DataFrame.
Estimators
An Estimator abstracts the concept of a learning ML algorithm that fits or trains on data. The estimator implements a method fit() that accepts a DataFrame and produces a Model.
A Pipeline consists of a sequence of stages of Transformers and Estimators. These stages run in order. The input DataFrame is transformed as it passes through each stage. The Transformer calls the transform() method at every transformer stage, and the Estimator calls the fit() method at every estimator stage.
Code Examples In Spark ML
In Spark ML we can make various kinds of machine learning models using spark.ml like logistic regression, linear regression, K-means clustering, Decision tree classifier, etc.
Let’s implement a binomial logistic regression classifier to better understand Classification models in Spark.
We can use Spark on Juypter notebook, google colab, etc. Here we are using google colab for Spark.
To use spark on google colab, first, open a new notebook on colab and then write this line of command.
!pip install pyspark
Now our google colab notebook is all set to use Spark for python.
To start, we will first build a new spark session.
! pip install pyspark # installing pyspark for Spark ML
from pyspark.sql import SparkSession # starting the spark Session for loading data into dataframes
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logistic regression').getOrCreate()

You can also try this code with Online Python Compiler
Run Code
Now we will get the sample data to perform logistic regression from the dataset and download the .txt file.
Now we will load the data, to this session of Spark.
df = spark.read.format("libsvm").load("sample_libsvm_data.txt")

You can also try this code with Online Python Compiler
Run Code
We are ready to implement the logistic regression.
from pyspark.ml.classification import LogisticRegression
logreg= LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fitting the data in the logistic regression model
Model = logreg.fit(df)
# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(Model.coefficients))
print("Intercept: " + str(Model.intercept))

You can also try this code with Online Python Compiler
Run Code
Frequently Asked Questions
What is Clustering in ML?
Clustering is a Machine Learning technique that involves the grouping of data points.
What is supervised learning?
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.
What is PySpark?
It is an interface for Apache Spark in Python. By using PySpark, we can write Python-like commands to analyze data in a distributed processing environment.
Conclusion
In this article, we got a brief introduction to Spark ML. We looked into different feature extractors in Spark ML. We also saw how to use Spark ML for Classification models, Regression models, and Clustering models.
Also, check out some of the Guided Paths, Contests, and Interview Experiences to gain an edge only on Coding Ninjas Studio.
