Table of contents
1.
Introduction
2.
DBSCAN
3.
How Does DBSCAN actually find the clusters?
4.
DBSCAN Pseudocode
5.
MinPts and 𝛜
6.
Examples or Comparison of DBSCAN with Other clustering methods
7.
DBSCAN using Python
8.
Code Example 
9.
FAQs
10.
Key Takeaways
Last Updated: Mar 27, 2024
Easy

DBSCAN

Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise, a density-based clustering algorithm similar to K-Means (a centroid Based clustering algorithm) and Agglomerative (a Hierarchical clustering algorithm). The DBSCAN algorithm grows regions with sufficiently high Density into clusters and discovers clusters of arbitrary shape in spatial datasets with noise. It basically defines a cluster as a maximal set of density-connected points. Generally, we partition points into dense regions by not-so-dense regions in a density-based clustering algorithm.

DBSCAN

We need to learn some new definitions before diving deep into how the DBSCAN algorithm finds the clusters and how it exactly works.

  • Density at a point P:  number of points within a region of radius đť›ś (Epsilon).
  • 𝛜-neighborhood: The neighborhood within a radius đť›ś of a given object is called đť›ś-neighborhood of the object
  • Density Region: A region of radius đť›ś that contains at least MinPts (Minimum Points) points.

 

                                                                                  source

  • Characterization of Points:
    • A point is said to be a core point if it has more than a specified number of points (MinPts) within a đť›ś. A core point will belong to the interior of the dense region.
    • A point is said to be a border point if it has fewer than MinPts, but it is a neighborhood of the core point.
    • A noise point is any point that is not a core point or a border point.

                                                                              source

  • Density EdgeAn edge is a connection between two core points, p and q if they are within distance 𝛜.
  • Directly density-reachable: An object p is directly density reachable from object q if p is đť›ś-neighborhood of q and q is a core object.
  • Density reachable: An object p is density-reachable from object q if q is a core object and p and q are density-connected objects.
  • Density Connected: A point p is density connected to point q if there is a path of edges from p to q.

How Does DBSCAN actually find the clusters?

DBSCAN searches for clusters for a given dataset by checking the đť›ś-neighborhood of each point in the database. If the đť›ś-neighborhood of a point p contains more than MinPts, a new cluster with p as a core object/point is created. DBSCAN then iteratively collects directly density-reachable objects from these core objects, which may involve merging the density-reachable clusters. The process terminates when no new point can be added to any cluster.

DBSCAN Pseudocode

Input: Dataset, đť›ś, MinPts [,additional parameters]

Output: Clusters of Data Points.

Process:

  • For Each data point of the dataset, label the points as core, border, and noisy points.
  • Remove the noisy points from the game.
  • For every core point p of the dataset that has not been assigned to a cluster, 
    • Create a new cluster with core point and all the points that are density-connected to p
  • Then Assign the border points to the cluster of the closest core point.
  • This process continues until no new point cannot be added to any cluster.

MinPts and 𝛜

MinPts: 

  1. In General, the MinPts must be greater than or equal to d+1, d = dimensionality of the dataset. Or in most cases, the Minpts will be twice the dimensionality.
  2. If the dataset is too noisy, then the value of MinPts should be large.
  3. Or, based on the user/domain interest, we can use their defined MinPts.

Eps (𝛜): 

           We can use the Knee/Elbow method to find the best Eps value. You can get more details regarding this method here: Finding Best Epsilon Value.

Examples or Comparison of DBSCAN with Other clustering methods

                                                                               source

DBSCAN using Python

Python scikit-learn Library provides an in-built DBSCAN method to apply the DBSCAN algorithm to cluster the dataset.

We can import DBSCAN from sklearn as follows:

from sklearn.cluster import DBSCAN
You can also try this code with Online Python Compiler
Run Code

And then apply DBSCAN for dataset X as follow:

clustering = DBSCAN(eps=value, min_samples=value2).fit(X)
You can also try this code with Online Python Compiler
Run Code

We can view how the clusters formed using clustering.labels_

This line gives an array of values of length of the dataset containing values ranging from 0 to n and -1, where each value represents a cluster and -1 represents a noise point.

Code Example 

from sklearn.cluster import DBSCAN
import numpy as np
X = np.array([[1, 5], [2, 3], [5,10],
              [8, 5], [8, 6], [70, 8]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
print(clustering.labels_)
You can also try this code with Online Python Compiler
Run Code


Output: array([ 0,  0, -1,  1,  1, -1], dtype=int64) 

#the values with 0 are one cluster and the values with 1 are belonging to another cluster. Here -1 represents that the point [70, 8] is a noisy point.

You can refer to more about DBSCAN here sklearn DBSCAN

FAQs

  1. What is DBSCAN stands for?
    DBSCAN, a clustering technique, stands for Density-Based Spatial Clustering of Applications with Noise. Here Density-Based refers that this technique uses Density as a concept to cluster the points on considering noise points removal.
  2. What are the parameters of the DBSCAN algorithm?
    There are majorly two parameters used in the DBSCAN algorithm: Epsilon and    MinPts, where epsilon is the radius of arbitrary shape, and MinPts is the number of minimum points that are to be considered to be in a cluster.
  3. What are the time and space complexities of DBSCAN?
    Time Complexity: O(n), can be reduced to O(n*logn) by using efficient hyperparameters such as Eps and MinPts. Space Complexity: The space complexity is O(n), n is the number of points to be clustered.
  4. Is there any built-in fancy method to use using Python?
    Python’s scikit-learn library provides a DBSCAN method that can be imported from sklearn.cluster module and can be used easily.

Key Takeaways

We have discussed more briefly the DBSCAN algorithm in this article, how it works, what are the essential features that this algorithm supports, what is the exact process of this clustering technique, and how it can be used using the scikit-learn built-in method, etc.,

Hey Ninjas! You can check out more unique courses on machine learning concepts through our official website, Coding Ninjas, and checkout Coding Ninjas Studio to learn through articles and other important stuff to your growth.

Happy Learning!

Live masterclass