Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
What is Apache Spark GraphX?
3.
Features of Apache Spark GraphX
4.
Uses of Apache Spark GraphX
4.1.
Social Media
4.2.
Web Analytics
4.3.
Transportation
4.4.
Fraud detection
4.5.
Biological Analysis
5.
Performing Data Analysis using GraphX
5.1.
Code
5.2.
Output
5.3.
Explanation
6.
Frequently Asked Questions
6.1.
What types of graphs can be processed using GraphX?
6.2.
Can GraphX handle dynamic graphs that change over time?
6.3.
Is GraphX limited to Scala programming language?
6.4.
Does GraphX provide support for graph visualisation?
6.5.
Why is GraphX so efficient?
7.
Conclusion
Last Updated: Mar 27, 2024
Medium

Introduction to Apache Spark GraphX

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Speaker
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM

Introduction

A Graph is a powerful data structure that efficiently represents complex interconnected data. Thus nowadays, the need to process large graphs is increasing. Processing large graphs can be challenging when data is distributed across multiple computers. 

Apache Spark is a data analysis tool that efficiently processes a large amount of data. Spark can handle different types of data. Apache Spark GraphX is the library focusing on quick graph processing.

Introduction

In this article, we will explore the main features and uses of Apache Spark GraphX. We will also understand how to analyse data using GraphX with an example.

What is Apache Spark GraphX?

Apache Spark GraphX is a library built on top of Apache Spark. It extends the Spark RDD (Resilient Distributed Dataset) API to support graph computations and analysis. GraphX introduces the concept of graphs as a directed multigraph of vertex and edges with attached user-defined properties. A directed multigraph is a graph which may have multiple directed edges between two nodes. 

GraphX allows us to seamlessly work with graphs and collections and efficiently transform and join graphs with RDDs. 

GraphX unifies various aspects of graph processing:

  1. GraphX helps perform the ETL process on graphs by allowing us to create, manipulate and join graphs with other data sources. ETL (Extract, Transform and Load) is a process of extracting data from multiple sources, converting it into a desired format and then loading it into a destination for further analysis.
     
  2. GraphX helps to perform Exploratory analysis by providing different methods to filter and query graph data. Exploratory analysis is the data analysis which finds valuable insights and patterns.
     
  3. GraphX offers different methods for Iterative graph computations. For example, PageRank is an algorithm in GraphX which iteratively computes a node's rank based on its neighbours' rank.

 

For using GraphX in Scala, we need to import the following packages:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Features of Apache Spark GraphX

Some of the main features of Apache Spark GraphX are:

  1. GraphX is flexible, allowing us to seamlessly work with graphs and collections. It also enables us to create graphs from various data sources and seamlessly integrates with other components of the Spark ecosystem.
     
  2. GraphX is very fast and performs on par with specialised graph processing systems. Using the power of Spark, It can efficiently work with graphs having billions of nodes.
     
  3. GraphX has a rich library of graph algorithms like SVD++, PageRank, Connected components, etc.

Uses of Apache Spark GraphX

Apache Spark GraphX is used wherever graph analysis is needed. Some popular uses include:

Social Media

GraphX is used to analyse social media traffic and find details like influencers, trends or recommendations. This helps businesses and individuals to reach a larger audience and create more engaging 

Web Analytics

GraphX is used in Web analytics to optimise web performance and user experience by analysing web data. This helps websites to improve their speed for better customer satisfaction.

Transportation

GraphX optimises transportation systems like roads, railways and flights, as it can easily calculate the optimal routes and their duration. This helps us save time and money.

Fraud detection

GraphX is also used in fraud detection. It can detect unusual patterns and identify fraud by conducting network-based fraud analysis. This helps banks and businesses to prevent losses and protect customers from scams.

Biological Analysis

GraphX is used to analyse biological networks and help in drug discovery and understanding biological pathways. This helps scientists and researchers to find new treatments and cures for diseases.

Performing Data Analysis using GraphX

To show how we can use GraphX to perform data analysis, we will use an example of a small social network. This network consists of six users and their friendship relationships. We will perform various analysis tasks in the following code to gain insights about the social network.

Code

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

// Create an RDD of vertices
val vertices: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
	(1L, ("A", 28)),
	(2L, ("B", 27)),
	(3L, ("C", 65)),
	(4L, ("D", 42)),
	(5L, ("E", 55)),
	(6L, ("F", 50))
))

// Create an RDD of edges
val edges: RDD[Edge[Double]] = sc.parallelize(Array(
	Edge(2L, 1L, 7.0),
	Edge(2L, 4L, 2.0),
	Edge(3L, 2L, 4.0),
	Edge(3L, 6L, 3.0),
	Edge(4L, 1L, 1.0),
	Edge(5L, 2L, 2.0),
	Edge(5L, 3L, 8.0),
	Edge(5L, 6L, 3.0)
))

// Create a graph from the vertices and edges
val graph: Graph[(String, Int), Double] = Graph(vertices, edges)

// Vertex and Edge Count
val vertexCount = graph.numVertices
val edgeCount = graph.numEdges
println(s"Number of vertices: $vertexCount")
println(s"Number of edges: $edgeCount")

// Vertex and Edge properties
val vertexNames = graph.vertices.map { case (_, (name, _)) => name }
println("Vertex Names:")
vertexNames.collect().foreach(println)

val edgeWeights = graph.edges.map(_.attr)
println("Edge Weights:")
edgeWeights.collect().foreach(println)

// Compute the PageRank score for each vertex
val pageRank = graph.pageRank(0.001).vertices
val ranks = graph.pageRank(0.001).vertices
println("PageRank Scores:")
ranks.collect().foreach(println)

val topInfluencers = pageRank.join(vertices).sortBy(_._2._1, ascending = false).take(3)
println("Top 3 Influential Users:")
topInfluencers.foreach { case (id, (score, (name, age))) =>
	println(s"$name (age: $age) has a PageRank score of $score")
}


To run this code, save it in a file, for example, code.scala. Then open the Spark Shell using the spark-shell command. 

Then load the code into the shell using the following command.

:load code.scala.

Output

Output

Explanation

The code executes the following steps:

  1. It first imports the required libraries for GraphX and RDDs.
     
  2. It then creates an RDD of vertices with the name of users and their ages as properties. The vertices represent the users in our social network.
     
  3. It then creates an RDD of edges with friendship scores as properties. These friendship scores are the edge weights.
     
  4. It then creates a Graph from vertices and edges RDDs using the Graph class from GraphX.
     
  5. Now, the graph is ready for different graph operations to analyse data. First, the code prints the graph's number of vertices and edges.
     
  6. After this, it prints the name of vertexes and the edge weights.
     
  7. After this, it uses the PageRank algorithm to calculate the importance score of each vertex. It also prints these importance scores.
     
  8. It then finds and prints the top 3 influential users based on their PageRank scores.

Frequently Asked Questions

What types of graphs can be processed using GraphX?

GraphX can process various graphs, including directed, undirected and multigraphs.

Can GraphX handle dynamic graphs that change over time?

GraphX does not have inbuilt support for dynamic graphs. However, it is possible to use external libraries or frameworks to handle dynamic graphs.

Is GraphX limited to Scala programming language?

GraphX is not limited to Scala and also provides Java and Python support.

Does GraphX provide support for graph visualisation?

GraphX does not directly support graph visualisation, but it allows us to export graphs to external tools for visualisation.

Why is GraphX so efficient?

GraphX is so efficient because of its integration with Apache Spark, which enables distributed and parallel processing. It also uses various optimisations like memory caching.

Conclusion

This article provided an overview of Apache Spark GraphX, a powerful graph computation and analysis library. We discussed its main features and uses. GraphX simplifies graph processing and is an essential tool for data analysis. 

We recommend reading the following articles to learn more about Apache Spark:

If you liked our article, do upvote our article and help other ninjas grow. You can refer to our Guided Path on Coding Ninjas Studio to upskill yourself in Data Structures and AlgorithmsCompetitive ProgrammingSystem Design, and many more!

Happy Learning!

Live masterclass