Table of contents
1.
Introduction
2.
How the K-means algorithm works
3.
Understanding K-means through an Example
3.1.
Step 1: Import the required libraries
3.2.
Step 2: Getting Random dataset
3.3.
Step 3: Processing the Data and Computing K-means clustering
3.4.
Step 4: Finding the centroid of clusters
3.5.
Step 5: Testing the algorithm
4.
Disadvantages in Using K-Means
5.
Key Takeaways
Last Updated: Mar 27, 2024

Understanding K-means Clustering

Author Anant Dhakad
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

One of the most basic and often used unsupervised machine learning algorithms is K-means clustering. Unsupervised algorithms, on the other hand, make inferences from datasets based solely on input vectors without referring to known, or labeled, outcomes.

The goal of K-means is straightforward: group comparable data points together to uncover hidden patterns. K-means searches a dataset for a fixed number (k) of clusters to achieve this goal.

A cluster is a collection of data points that have been grouped due to particular similarities. We'll set a target number, k, for the number of centroids required in the dataset. 

A centroid is an imaginary or real location that represents the cluster's center. By reducing the in-cluster sum of squares, each data point is assigned to one of the clusters. 

The K-means algorithm finds k centroids and then assigns each data point to the closest cluster while keeping the centroids as small as possible. The average of the data, or determining the centroid, is what the “means” in K-means refers to.

How the K-means algorithm works

The K-means technique in machine learning starts with the first group of randomly picked centroids, which serve as the starting points for each cluster. It then performs iterative (repetitive) calculations to optimize the centroids' positions.

It stops forming and optimizing clusters when either the centroids have stabilized — that is, their values have not changed due to successful clustering — or the specified number of iterations has been reached.

Understanding K-means through an Example

Let’s try understanding K-means through an example. In this example, we will be using some random datasets and the Scikit-learn library.

Step 1: Import the required libraries

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
You can also try this code with Online Python Compiler
Run Code


Here in the above code, we import the following libraries : 

  1. NumPy: to perform efficient calculations on matrices.
  2. matplotlib: to help visualize the data.
     

Step 2: Getting Random dataset

points= -2 * np.random.rand(100,2) # (will give values in range (-2, 0))
tmp1 = 1 + 2 * np.random.rand(50,2) # (will give values in range (1, 3))
points[50:100, :] = tmp1 #(making lower half values in range(1, 3))
 
# displaying the generated dataset
plt.scatter(points[ : , 0], points[ :, 1], s = 50, c = 'b')
plt.show()
You can also try this code with Online Python Compiler
Run Code


The above code generates random 100 data points divided into two different groups of 50 each. We have also displayed the generated data to help us visualize the clusters. Here is how the data points look in a 2D plane. 

 

Step 3: Processing the Data and Computing K-means clustering

from sklearn.cluster import KMeans
 
Kmean = KMeans(n_clusters=2) # choosing k value arbitrarily clusters
 
# Compute k-means clustering.
Kmean.fit(points)
You can also try this code with Online Python Compiler
Run Code


Here we use the available library function in scikit-learn to process the data. 

 

Step 4: Finding the centroid of clusters

clusters_centers_  attribute of KMeans gives the center of the clusters.

# Coordinates of cluster centers.
Kmean.cluster_centers_
You can also try this code with Online Python Compiler
Run Code

 

Here is the output for our data.

array([[-1.03169853, -0.88661647],
       [ 1.978359  ,  1.9484259 ]])

 

Let’s use matplotlib and visualize the centroids along with other data points on a 2D plane.

centroid0 = Kmean.cluster_centers_[0][:]
centroid1 = Kmean.cluster_centers_[1][:]
You can also try this code with Online Python Compiler
Run Code
# Displaying the cluster centroids (using yellow and blue color)
plt.scatter(points[ : , 0], points[ : , 1], s=50, c='b')
plt.scatter(centroid0[0], centroid0[1], s=200, c='y', marker='s')
plt.scatter(centroid1[0], centroid1[1], s=200, c='g', marker='s')
plt.show()
You can also try this code with Online Python Compiler
Run Code


Output

 

Step 5: Testing the algorithm

Let’s display the label of each data point that was in our dataset. 

# Labels of each point
Kmean.labels_
You can also try this code with Online Python Compiler
Run Code


Output

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

 

From the output, we can infer that the first 50 points are in cluster 0 and the next 50 in cluster 1. This can also be verified from the figure in which we displayed data points on a 2D plane. 

 

Now let us take a random point and predict the cluster to which its more closer. 

data_point=np.array([-2.0,-2.0])
test_points=data_point.reshape(1, -1)
 
# Predict the closest cluster each sample in sample_test belongs to
Kmean.predict(test_points)
You can also try this code with Online Python Compiler
Run Code


Output


array([0])

 

Output shows that (-2, -2) belongs to cluster 0 (i.e nearer to yellow centroid)

You can find the complete code here.

Disadvantages in Using K-Means

For data cluster analysis, K-means clustering is a widely used technique.

However, slight alterations in the data might lead to considerable variance, so its performance is usually not as good as that of other complex clustering algorithms.

Furthermore, clusters are considered spherical and uniform in size, which could lower the precision of the K-means clustering Python findings.

Key Takeaways

Cheers if you reached here!! In this blog, we used a random dataset to understand how the K-Means clustering algorithm works.

Yet learning never stops, and there is a lot more to learn. Happy Learning!!

Live masterclass