Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
What is KNN?
3.
KNN Algorithm
4.
Advantages of KNN
5.
Disadvantages of KNN
6.
What is K-Means?
7.
K-Means Algorithm
8.
Advantages of K-Means
9.
Disadvantages of K-Means
10.
Difference Between KNN and K-Means
11.
Implementation
12.
Importing Necessary Libraries
13.
Loading Data
14.
Visualization
15.
Preprocessing
15.1.
Data imputation
15.2.
Label Encoding
15.3.
Insignificant Features
15.4.
Train-Test Split
16.
Training 
17.
 
18.
 
18.1.
Results
19.
How to Find the Best K?
19.1.
For KNN
19.2.
For K-Means
20.
Frequently Asked Questions
20.1.
What is KNN Best For?
20.2.
What is the Difference Between Nearest Neighbor and K-Nearest Neighbor?
20.3.
Why is KNN Called Lazy Learner?
21.
Conclusion
Last Updated: Jun 24, 2024
Easy

Difference between KNN and K-Means

Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

In the realm of machine learning, the terms KNN (K-Nearest Neighbors) and K-Means frequently emerge, often leading to confusion due to their seemingly similar nomenclature. However, despite sharing the "K" in their names and being fundamental algorithms in the field, KNN and K-Means serve entirely different purposes and operate on distinct principles. KNN algorithm makes predictions or classifications for the new data based on existing data. On the other hand, the K-means algorithm looks for patterns or groups within the data. It helps to organize the data into clusters with similar characteristics without making predictions.

Understanding these differences is crucial for data scientists and machine learning practitioners to apply the right algorithm to the right problem.

Difference between KNN and K-Means

What is KNN?

KNN is a supervised machine learning algorithm that is used for classification problems. Since it is a supervised machine learning algorithm, it uses labeled data to make predictions.

KNN analyzes the 'k' nearest data points and then classifies the new data based on the same. 

In detail, to label a new point, the KNN algorithm analyzes the ‘k’ nearest neighbors or ‘k’ nearest data points to the new point. It chooses the label of the new point as the one to which the majority of the ‘k’ nearest neighbors belong to. 

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

KNN Algorithm

The various steps involved in KNN are as follows:-

  • Choose the value of ‘K’  where ‘K’ refers to the number of nearest neighbors of the new data point to be classified.
  • Now, compute the euclidian distance between the new input (new data point) and all the training data. 

Example: Let us assume we have two points, A1(X1, Y1) and B2(X2, Y2). Then the euclidian distance between the two points would be the following:-

 

 

  • Sort these distances in ascending order and choose the first ‘K’ minimum distance values. This will give us the ‘K’ nearest neighbors of the new data point. 
  • Now, find out the label/class to which all these neighbors belong. 
  • Find the majority class these neighbors belong to and assign that particular label to the new input. 
  • Finally, return the predicted class of the new data point. 

Note: It is essential to choose an appropriate value of ‘K’ to avoid the overfitting of our model. 

Advantages of KNN

  • Simplicity and Ease of Implementation: KNN is straightforward to understand and implement, making it an excellent choice for beginners in machine learning.
  • No Training Phase: Unlike many other algorithms, KNN does not require a training phase, which simplifies the process and reduces computation time for model building.
  • Adaptability to Different Data Distributions: KNN can handle multi-class classification problems and works well with various data distributions without making strong assumptions about the underlying data.
  • Versatility in Applications: It is widely used in numerous applications, such as recommendation systems, image recognition, and anomaly detection.

Disadvantages of KNN

  • Computationally Intensive: KNN can be slow and inefficient with large datasets because it requires computing the distance between the query point and all other points in the dataset.
  • Sensitivity to Irrelevant Features: The performance of KNN can degrade if the dataset contains many irrelevant features, as these can distort distance calculations.
  • Memory Usage: Since KNN stores all training data, it can require significant memory, especially with large datasets.
  • Curse of Dimensionality: KNN struggles with high-dimensional data, where the notion of distance becomes less meaningful, leading to poor classification performance

What is K-Means?

K-Means is an unsupervised machine learning algorithm that is used for clustering problems. Since it is an unsupervised machine learning algorithm, it uses unlabelled data to make predictions.

K-Means is nothing but a clustering technique that analyzes the mean distance of the unlabelled data points and then helps to cluster the same into specific groups. 

In detail, KNN divides unlabelled data points into specific clusters/groups of points. As a result, each data point belongs to only one cluster that has similar properties.

K-Means Algorithm

The various steps involved in K-Means are as follows:-

  • Choose the 'K' value where 'K' refers to the number of clusters or groups. 
  • Randomly initialize 'K' centroids as each cluster will have one center. So, for example, if we have 7 clusters, then we would initialize seven centroids.
  • Now, compute the euclidian distance of each current data point to all the cluster centers. Based on this, assign each data point to its nearest cluster. This is known as the 'E- Step.' 

Example: Let us assume we have two points, A1(X1, Y1) and B2(X2, Y2). Then the euclidian distance between the two points would be the following:-

  • Now, update the cluster center locations by taking the mean of the data points assigned. This is known as the 'M-Step.'  
  • Repeat the above two steps until convergence, i.e., until we reach a global optimum where no further optimization is possible. 

Advantages of K-Means

  • Simplicity and Scalability: K-Means is relatively easy to implement and can handle large datasets efficiently.
  • Speed and Efficiency: The algorithm converges quickly and has a linear time complexity, making it suitable for large-scale applications.
  • Adaptable to Various Applications: K-Means is versatile and can be used in numerous clustering applications, such as market segmentation, image compression, and anomaly detection.
  • Interpretability: The resulting clusters from K-Means are often easy to understand and interpret, providing clear insights into the structure of the data.

Disadvantages of K-Means

  • Predefined Number of Clusters: K-Means requires specifying the number of clusters (K) beforehand, which can be challenging without prior knowledge of the data.
  • Sensitivity to Initialization: The final clusters can vary significantly based on the initial placement of centroids, leading to potentially suboptimal solutions.
  • Assumption of Spherical Clusters: K-Means assumes clusters are spherical and equally sized, which may not always align with the true distribution of data.
  • Outliers and Noise Sensitivity: The algorithm is sensitive to outliers and noisy data, which can distort the clustering results.

Difference Between KNN and K-Means

ParameterKNN (K-Nearest Neighbors)K-Means
TypeSupervised LearningUnsupervised Learning
PurposeClassification and RegressionClustering
Training PhaseNoneRequires iterative training
Working PrincipleBased on distances to nearest neighborsBased on minimizing within-cluster variance
Memory UsageHigh, stores all training dataLower, only stores cluster centroids
Computational CostHigh for large datasetsRelatively lower, but depends on number of iterations
SensitivitySensitive to irrelevant features and high-dimensional dataSensitive to initialization, outliers, and predefined K
ApplicationsPattern recognition, recommendation systems, anomaly detectionMarket segmentation, image compression, anomaly detection

Implementation

For simplicity, we would use the already existing sklearn library for KNN and K-Means implementation. 

Importing Necessary Libraries

Firstly, we will load some basic libraries:-

(i) Numpy - for linear algebra. 

(ii) Pandas - for data analysis. 

(iii) Seaborn - for data visualization.

(iv) Matplotlib - for data visualisation. 

(v) KNeighborsClassifier - for using KNN.

(vi) KMeans - for using K-Means.

(vii) classification_report - for generating numerous results.

import numpy as np 
import pandas as pd 
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

 

Loading Data

#loading dataset 
df= pd.read_csv('iris.csv')

 

Visualization

We visualize the dataset by printing the first ten rows of the data frame. We use the head() function for the same. 

#visualizing dataset
df.head(n=10)

 

Output 

#finding different class labels 
np.unique(df['Species'])

Output

We notice that there are three different classes. 

df.shape

 

Output 

The dataset has 150 egs and 6 columns (including one target variable). 

df.info()

Output

  

#finding correlation of features 
correl=df.corr()
sns.heatmap(correl,annot=True)

 

Output 

 

Preprocessing

Data imputation

#checking for Null values
df.isnull().sum()

 

Output

 

 

We observe that the dataset does not contain any Null values. 

Label Encoding

We perform label encoding for converting the categorical feature ‘Species’ into a numerical one. 

#Label Encoding - for encoding categorical features into numerical ones
encoder = LabelEncoder()
df['Species'] = encoder.fit_transform(df['Species'])

 

 

df

 

Output

 

 

#finding different class labels 
np.unique(df['Species'])

 

Output

As noticeable above, all target values are now numerical. 

Insignificant Features

We drop ‘ID’ as this feature is insignificant. 

#DROPPING ID 
df= df.drop(['Id'], axis = 1)

 

df.shape

 

Output

Now, we have just 150 examples and 5 columns. 

Train-Test Split

Now, we will divide our data into training data and testing data. We will have a 3:1 train test split.

#converting dataframe to np array 
data = df.values 

X=data [:, 0:5]
Y= data [: , -1]

print(X.shape)
print(Y.shape)

#train-test split = 3:1 

train_x = X[: 112, ]
train_y = Y[:112, ]

test_x = X[112:150, ]
test_y = Y[112:150, ]

print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)

 

Output

Training 

We will build our KNN and KMeans models using the sklearn library and then train them. 

# KNN 

knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(train_x, train_y)

# training predictions
train_preds= knn.predict(train_x)

# testing predictions
test_preds = knn.predict(test_x)

 

#KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(train_x, train_y)

# training predictions
train_labels= kmeans.predict(train_x)

#testing predictions
test_labels = kmeans.predict(test_x)

 

 

 

Results

Now, we analyze our models and generate the results.

# KNN model accuracy

#training accuracy
print(accuracy_score(train_y, train_preds)*100)
#testing accuracy
print(accuracy_score(test_y, test_preds)*100)

 

Output

 

We notice that we get good results on both training and testing sets for KNN. The training set gives us a score of 99.10, whereas the testing set gives us a score of 97.36.

#KMeans model accuracy

 

#training accuracy
print(accuracy_score(train_y, train_labels)*100)
#testing accuracy
print(accuracy_score(test_labels, test_y)*100)

 

Output

We notice that we get good results on both training and testing sets for KMeans too. The training set gives us a score of 99.10, whereas the testing set gives us a score of 94.73.

How to Find the Best K?

For KNN

  • Cross-Validation: Use cross-validation to test different values of K and select the one that provides the best performance in terms of accuracy, precision, recall, or another relevant metric.
  • Elbow Method: Plot the error rate for different values of K. Look for the "elbow point" where the error rate starts to diminish, indicating a good balance between underfitting and overfitting.
  • Domain Knowledge: Consider the context and specific requirements of your application to select an appropriate K. Domain expertise can provide valuable insights into the best K value.

For K-Means

  • Elbow Method: Plot the within-cluster sum of squares (WCSS) for different values of K. The point where the rate of decrease sharply slows down (the "elbow") indicates the optimal number of clusters.
  • Silhouette Score: Calculate the silhouette score for different values of K. The value of K that maximizes the silhouette score is typically considered the best choice.
  • Gap Statistic: Compare the total within-cluster variation for different values of K with their expected values under null reference distribution of the data to determine the optimal number of clusters.
  • Domain Knowledge: Leverage any domain-specific knowledge to inform the selection of K, as certain applications may have known or expected cluster structures.

Frequently Asked Questions

What is KNN Best For?

KNN is best for classification and regression tasks, particularly in situations where the decision boundary is complex and non-linear.

What is the Difference Between Nearest Neighbor and K-Nearest Neighbor?

Nearest neighbour refers to using the single closest data point for classification, while K-nearest neighbor uses the majority vote of the K closest points.

Why is KNN Called Lazy Learner?

KNN is called a lazy learner because it does not learn a discriminative function from the training data but memorizes the training dataset instead.

Conclusion

Congratulations on making it this far. This blog discussed a fundamental overview of both KNN and KMeans!!

We learned about Data Loading, Data Visualisation, Data Preprocessing, and Training. We learned how to visualize data then, based on this EDA, took significant decisions concerning preprocessing, made our models training ready, and finally generated the results for them.

Previous article
Singular Value Decomposition
Next article
Map Style Dataset Vs Iterable Dataset
Live masterclass