1.
Introduction
2.
What is K-Means?
3.
K-Means Algorithm
4.
Implementation
5.
Importing Necessary Libraries
6.
7.
Visualization
8.
Preprocessing
8.1.
Data imputation
8.2.
Label Encoding
8.3.
Insignificant Features
8.4.
Train-Test Split
9.
Training
9.1.
Results
10.
11.
Key Takeaways
Last Updated: Mar 27, 2024

# Applying K-Means on Iris Dataset

## Introduction

We’ll be learning about a very famous machine learning algorithm - K-Means and a very popular dataset - Iris Dataset.

In short, K-Means is an unsupervised machine learning algorithm used for clustering. The Iris Dataset is a very well-known dataset used to predict the Iris flower species based on a few given properties.

## What is K-Means?

K-Means is an unsupervised machine learning algorithm that is used for clustering problems. Since it is an unsupervised machine learning algorithm, it uses unlabelled data to make predictions.

K-Means is nothing but a clustering technique that analyzes the mean distance of the unlabelled data points and then helps to cluster the same into specific groups.

In detail, K-Means divides unlabelled data points into specific clusters/groups of points. As a result, each data point belongs to only one cluster that has similar properties.

## K-Means Algorithm

The various steps involved in K-Means are as follows:-

→ Choose the 'K' value where 'K' refers to the number of clusters or groups.

→ Randomly initialize 'K' centroids as each cluster will have one center. So, for example, if we have 7 clusters, we would initialize seven centroids.

→ Now, compute the euclidian distance of each current data point to all the cluster centers. Based on this, assign each data point to its nearest cluster. This is known as the 'E- Step.'

Example: Let us assume we have two points, A1(X1, Y1) and B2(X2, Y2). Then the euclidian distance between the two points would be the following:-

→ Now, update the cluster center locations by taking the mean of the data points assigned. This is known as the 'M-Step.'

→ Repeat the above two steps until convergence, i.e., until we reach a global optimum where no further optimization is possible.

Iris Dataset

We will be using the Iris Dataset and applying K-Means on the same.

The Iris Dataset helps predict the Iris flower species based on a few given properties. It consists of 5 features and one target variable.

(i) Id - ID of the flower for differentiating, numerical feature.

(ii) SepalLengthCm - sepal length of the flower, numerical feature.

(iii) SepalWidthCm - sepal width of the flower, numerical feature.

(iv) PetalLengthCm - petal length of the flower, numerical feature.

(v) PetalWidthCm - petal width of the flower, numerical feature.

(vi) Species - iris species , target variable / label.

## Implementation

For simplicity, we would use the already existing sklearn library for K-Means implementation.

## Importing Necessary Libraries

Firstly, we will load some basic libraries:-

(i) Numpy - for linear algebra.

(ii) Pandas - for data analysis.

(iii) Seaborn - for data visualization.

(iv) Matplotlib - for data visualisation.

(v) KMeans - for using K-Means.

(vi) LabelEncoder - for label encoding.

(vii) classification_report - for generating numerous results.

(viii) accuracy_score - for generating model accuracy.

Above, we load the data using pandas.

## Visualization

We visualize the dataset by printing the first ten rows of the data frame. We use the head() function for the same.

Output

Output

We notice that there are three different classes present in the dataset.

Output

We observe that our dataset consists of 150 rows and six columns.

Output

Output

From the above, we observe that bigger values are represented with light color. This observation will always be the same for the heatmap. Dark values will always be less than light-colored values.

Now, we will use Matplotlib for a scatter plot.

Output

## Preprocessing

### Data imputation

Output

We observe that the dataset does not contain any Null values.

### Label Encoding

We perform label encoding for converting the categorical feature ‘Species’ into a numerical one.

Output

Output

As noticeable above, all target values are now numerical.

### Insignificant Features

We drop ‘ID’ as this feature is insignificant.

Output

### Train-Test Split

Now, we will divide our data into training data and testing data. We will have a 3:1 train test split.

Output

## Training

We will build our KMeans model using the sklearn library and then train it on the given iris dataset.

### Results

Now, we analyze our models and generate the result.

Output

We notice that we get good results on both training and testing sets. The training set gives us a score of 99.10, whereas the testing set gives us a score of 94.73.

Finally, we will generate a classification report for in-depth analysis.

Output

An advantage of KMeans is that it is computationally very fast. A disadvantage of the same is that it does not work too well with clusters of different sizes.

2. What is the importance of clustering in ML?
Clustering helps identify and group similar data points in larger datasets without concern for the specific outcome.

3. What does the ‘K’ in K-Means stand for?
‘K’ refers to the number of clusters in K-means.

## Key Takeaways

Congratulations on making it this far. This blog discussed a fundamental overview of KMeans along with the Iris Dataset!!

We learned about Data Loading, Data Visualisation, Data Preprocessing, and Training. We learned how to visualize data then, based on this EDA, took significant decisions concerning preprocessing, made our model training ready, and finally generated the results for it.

If you are preparing for the upcoming Campus Placements, don’t worry. Coding Ninjas has your back. Visit this link for a carefully crafted and designed course on-campus placements and interview preparation.

Live masterclass