Table of contents
1.
Introduction
2.
Role of Distance Measures
3.
Hamming Distance
3.1.
Formula
3.2.
Example
4.
Euclidean Distance
4.1.
Formula
4.2.
Example
5.
Manhattan Distance (Taxicab or City Block)
5.1.
Formula
5.2.
Example
6.
Minkowski Distance
6.1.
Formula
6.2.
Example
7.
Mahalanobis Distance
7.1.
Formula
7.2.
Example
8.
Cosine Similarity
8.1.
Formula
8.2.
Example
9.
Frequently Asked Questions
9.1.
Which distance measure is best for categorical data?
9.2.
How does Euclidean Distance differ from Manhattan Distance?
9.3.
When should I use Cosine Similarity instead of Euclidean Distance?
10.
Conclusion
Last Updated: Mar 13, 2025
Medium

Distance-Based Algorithm in Data Mining

Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Distance-based algorithms in data mining are used to classify, cluster, or retrieve data points based on their distance from one another. These algorithms measure similarity using distance metrics like Euclidean, Manhattan, or Minkowski distances. They are widely applied in clustering (K-Means), classification (K-Nearest Neighbors), and anomaly detection. 

Distance-Based Algorithm in Data Mining

In this article, we will explore different distance-based algorithms, their working principles, advantages, limitations, and real-world applications.

Role of Distance Measures

Distance measures help determine how similar or different two data points are. They are widely used in machine learning models such as k-Nearest Neighbors (k-NN), clustering algorithms (like K-Means), and recommendation systems. The choice of distance metric can significantly impact model accuracy and efficiency.

Hamming Distance

Hamming Distance measures the number of positions at which two strings of equal length differ. It is mainly used in error detection and correction algorithms.

Formula

Hamming Distance = Number of differing bits/characters

Example

# Function to calculate Hamming Distance
def hamming_distance(str1, str2):
    if len(str1) != len(str2):
        return "Strings must be of equal length"
    return sum(c1 != c2 for c1, c2 in zip(str1, str2))

# Example usage
str1 = "1101"
str2 = "1001"
print(hamming_distance(str1, str2))
You can also try this code with Online Python Compiler
Run Code


Output:

1

Euclidean Distance

Euclidean Distance is the most commonly used metric for calculating the straight-line distance between two points in a multi-dimensional space.

Formula

Formula

Where:

  • P=(p1,p2,...,on) and Q=(q1,q2,...,qn)Q = (q_1, q_2, ..., q_n) are two points in n-dimensional space.
     
  • d(P,Q) is the Euclidean distance.
     
  • (qi−pi) represents the difference between corresponding coordinates.

Example

from math import sqrt
def euclidean_distance(point1, point2):
    return sqrt(sum((x - y) ** 2 for x, y in zip(point1, point2)))
# Example usage
p1 = (3, 4)
p2 = (7, 1)
print(euclidean_distance(p1, p2))
You can also try this code with Online Python Compiler
Run Code


Output:

5.0

Manhattan Distance (Taxicab or City Block)

Manhattan Distance calculates the total absolute difference between the coordinates of two points.

Formula

For two points P and Q in an n-dimensional space, the Manhattan distance is given by:

Formula

Where:

  • P=(p1,p2,...,on) and Q=(q1,q2,...,qn are two points in n-dimensional space. 
     
  • d(P,Q) is the Manhattan distance.
     
  • ∣qi−pi∣ represents the absolute difference between corresponding coordinates.

Example

def manhattan_distance(point1, point2):
    return sum(abs(x - y) for x, y in zip(point1, point2))
# Example usage
p1 = (1, 2)
p2 = (4, 6)
print(manhattan_distance(p1, p2))
You can also try this code with Online Python Compiler
Run Code


Output:

7

Minkowski Distance

Minkowski Distance is a generalized distance metric that includes both Euclidean and Manhattan distances as special cases.

Formula

For two points P and Q in an n-dimensional space, the Minkowski distance is given by:

Formula

Where:

  • P=(p1,p2,...,pn)P and Q=(q1,q2,...,qn) are two points.
     
  • p is the order of the distance metric.

Example

def minkowski_distance(point1, point2, p):
    return sum(abs(x - y) ** p for x, y in zip(point1, point2)) ** (1/p)
# Example usage
p1 = (1, 2)
p2 = (4, 6)
print(minkowski_distance(p1, p2, 3))
You can also try this code with Online Python Compiler
Run Code


Output:

4.3267487109222245

Mahalanobis Distance

Mahalanobis Distance accounts for correlations between variables and is used in multivariate anomaly detection.

Formula

Formula

Where:

  • P and Q are vectors (points).
     
  • S−1 is the inverse of the covariance matrix.
     
  • Unlike Euclidean distance, Mahalanobis distance considers the shape of the data distribution.

Example

import numpy as np
def mahalanobis_distance(x, y, cov_matrix):
    x, y = np.array(x), np.array(y)
    diff = x - y
    return np.sqrt(np.dot(np.dot(diff.T, np.linalg.inv(cov_matrix)), diff))
# Example usage
x = [2, 3]
y = [6, 8]
cov_matrix = np.array([[1, 0], [0, 1]])
print(mahalanobis_distance(x, y, cov_matrix))
You can also try this code with Online Python Compiler
Run Code


Output:

6.4031242374328485

Cosine Similarity

Cosine Similarity measures the cosine of the angle between two vectors and is widely used in text mining and recommendation systems.

Formula

Where:

  • A⋅B is the dot product of vectors A and B.
     
  • ∣∣A∣∣ and ∣∣B∣∣are the magnitudes (norms) of the vectors.

Example

from numpy import dot
from numpy.linalg import norm
def cosine_similarity(vec1, vec2):
    return dot(vec1, vec2) / (norm(vec1) * norm(vec2))
# Example usage
vec1 = [1, 2, 3]
vec2 = [4, 5, 6]
print(cosine_similarity(vec1, vec2))
You can also try this code with Online Python Compiler
Run Code


Output:

0.9746318461970762

Frequently Asked Questions

Which distance measure is best for categorical data?

Hamming Distance is best for categorical data as it counts differences in corresponding positions of strings or binary data.

How does Euclidean Distance differ from Manhattan Distance?

Euclidean Distance calculates the shortest straight-line distance, whereas Manhattan Distance sums up the absolute differences in coordinates.

When should I use Cosine Similarity instead of Euclidean Distance?

Cosine Similarity is preferred for text data and high-dimensional spaces where the magnitude of vectors is less important than their direction.

Conclusion

In this article, we discussed distance-based algorithms in data mining, which are used to measure the similarity or dissimilarity between data points. These algorithms, such as K-Nearest Neighbors (KNN) and K-Means clustering, rely on distance metrics like Euclidean, Manhattan, or Minkowski distance to group or classify data. These methods are essential for tasks like classification, clustering, and anomaly detection in data mining.

Live masterclass