Table of contents
1.
Introduction
2.
What is K-Means?
3.
K-Means Algorithm
4.
Implementation
5.
Importing Necessary Libraries
6.
Loading Data 
7.
Visualization
8.
Preprocessing
8.1.
Data imputation
8.2.
Label Encoding
8.3.
Insignificant Features
8.4.
Train-Test Split
9.
Training 
9.1.
Results
10.
Frequently Asked Questions
11.
Key Takeaways
Last Updated: Mar 27, 2024

Applying K-Means on Iris Dataset

Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

We’ll be learning about a very famous machine learning algorithm - K-Means and a very popular dataset - Iris Dataset.

 

In short, K-Means is an unsupervised machine learning algorithm used for clustering. The Iris Dataset is a very well-known dataset used to predict the Iris flower species based on a few given properties. 

 

What is K-Means?

K-Means is an unsupervised machine learning algorithm that is used for clustering problems. Since it is an unsupervised machine learning algorithm, it uses unlabelled data to make predictions.

 

K-Means is nothing but a clustering technique that analyzes the mean distance of the unlabelled data points and then helps to cluster the same into specific groups. 

 

In detail, K-Means divides unlabelled data points into specific clusters/groups of points. As a result, each data point belongs to only one cluster that has similar properties.

 

 

K-Means Algorithm

The various steps involved in K-Means are as follows:-

 

→ Choose the 'K' value where 'K' refers to the number of clusters or groups. 

 

→ Randomly initialize 'K' centroids as each cluster will have one center. So, for example, if we have 7 clusters, we would initialize seven centroids.

 

→ Now, compute the euclidian distance of each current data point to all the cluster centers. Based on this, assign each data point to its nearest cluster. This is known as the 'E- Step.' 

 

Example: Let us assume we have two points, A1(X1, Y1) and B2(X2, Y2). Then the euclidian distance between the two points would be the following:-

 

→ Now, update the cluster center locations by taking the mean of the data points assigned. This is known as the 'M-Step.'  

 

→ Repeat the above two steps until convergence, i.e., until we reach a global optimum where no further optimization is possible. 

 

Iris Dataset

 

Link to the dataset https://www.kaggle.com/pralabhpoudel/iris-classification-report-97-accuracy/data?select=Iris.csv

 

We will be using the Iris Dataset and applying K-Means on the same. 

 

The Iris Dataset helps predict the Iris flower species based on a few given properties. It consists of 5 features and one target variable. 

 

(i) Id - ID of the flower for differentiating, numerical feature. 

 

(ii) SepalLengthCm - sepal length of the flower, numerical feature. 

(iii) SepalWidthCm - sepal width of the flower, numerical feature. 

(iv) PetalLengthCm - petal length of the flower, numerical feature.  

(v) PetalWidthCm - petal width of the flower, numerical feature.  

(vi) Species - iris species , target variable / label. 

Implementation

For simplicity, we would use the already existing sklearn library for K-Means implementation.

 

Importing Necessary Libraries

Firstly, we will load some basic libraries:-

 

(i) Numpy - for linear algebra. 

 

(ii) Pandas - for data analysis. 

 

(iii) Seaborn - for data visualization.

 

(iv) Matplotlib - for data visualisation. 

 

(v) KMeans - for using K-Means.

 

(vi) LabelEncoder - for label encoding. 

 

(vii) classification_report - for generating numerous results.

 

(viii) accuracy_score - for generating model accuracy. 

 

import numpy as np 
import pandas as pd 
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

 

Loading Data 

#loading dataset 
df= pd.read_csv('iris.csv')

 

Above, we load the data using pandas. 

Visualization

We visualize the dataset by printing the first ten rows of the data frame. We use the head() function for the same. 

#visualizing dataset
df.head(n=10)

 

Output

 

#finding different class labels 
np.unique(df['Species'])

 

Output

 

 

We notice that there are three different classes present in the dataset. 

 

df.shape

 

Output

 

We observe that our dataset consists of 150 rows and six columns. 

 

df.info()

 

Output

 

 

#finding correlation of features 
correl=df.corr()
sns.heatmap(correl,annot=True)

 

 

Output

 

 

From the above, we observe that bigger values are represented with light color. This observation will always be the same for the heatmap. Dark values will always be less than light-colored values.

 

Now, we will use Matplotlib for a scatter plot. 

ax = df[df.Species=='Iris-setosa'].plot.scatter(x='SepalLengthCm', y='SepalWidthCm'
                                                    color='red', label='Iris - Setosa')
df[df.Species=='Iris-versicolor'].plot.scatter(x='SepalLengthCm', y='SepalWidthCm'
                                                color='green', label='Iris - Versicolor', ax=ax)
df[df.Species=='Iris-virginica'].plot.scatter(x='SepalLengthCm', y='SepalWidthCm'
                                                color='blue', label='Iris - Virginica', ax=ax)
ax.set_title("Scatter Plot")

 

Output

 

Preprocessing

Data imputation

#checking for Null values
df.isnull().sum()

 

Output

 

 

We observe that the dataset does not contain any Null values.

 

Label Encoding

We perform label encoding for converting the categorical feature ‘Species’ into a numerical one. 

#Label Encoding - for encoding categorical features into numerical ones
encoder = LabelEncoder()
df['Species'] = encoder.fit_transform(df['Species'])

 

 

df

 

Output

 

 

#finding different class labels 
np.unique(df['Species'])

 

Output

 

 

As noticeable above, all target values are now numerical. 

 

Insignificant Features

We drop ‘ID’ as this feature is insignificant. 

#DROPPING ID 
df= df.drop(['Id'], axis = 1)

 

df.shape

 

Output

 

 

Train-Test Split

Now, we will divide our data into training data and testing data. We will have a 3:1 train test split.

#converting dataframe to np array 
data = df.values 

X=data [:, 0:5]
Y= data [: , -1]

print(X.shape)
print(Y.shape)

#train-test split = 3:1 

train_x = X[: 112, ]
train_y = Y[:112, ]

test_x = X[112:150, ]
test_y = Y[112:150, ]

print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)

 

 

Output

 

 

Training 

We will build our KMeans model using the sklearn library and then train it on the given iris dataset. 

#KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(train_x, train_y)

# training predictions
train_labels= kmeans.predict(train_x)

#testing predictions
test_labels = kmeans.predict(test_x)

Results

Now, we analyze our models and generate the result.

#KMeans model accuracy

#training accuracy
print(accuracy_score(train_y, train_labels)*100)
#testing accuracy
print(accuracy_score(test_labels, test_y)*100)

 

 

Output

 

 

We notice that we get good results on both training and testing sets. The training set gives us a score of 99.10, whereas the testing set gives us a score of 94.73.

 

Finally, we will generate a classification report for in-depth analysis. 

#classification report for training set 
print(classification_report(train_y, train_labels))

 

Output

 

Frequently Asked Questions

  1. What is the advantage as well as disadvantage of KMeans?
    An advantage of KMeans is that it is computationally very fast. A disadvantage of the same is that it does not work too well with clusters of different sizes. 
     
  2. What is the importance of clustering in ML?
    Clustering helps identify and group similar data points in larger datasets without concern for the specific outcome.
     
  3. What does the ‘K’ in K-Means stand for?
    ‘K’ refers to the number of clusters in K-means.

Key Takeaways

Congratulations on making it this far. This blog discussed a fundamental overview of KMeans along with the Iris Dataset!!

We learned about Data Loading, Data Visualisation, Data Preprocessing, and Training. We learned how to visualize data then, based on this EDA, took significant decisions concerning preprocessing, made our model training ready, and finally generated the results for it.

If you are preparing for the upcoming Campus Placements, don’t worry. Coding Ninjas has your back. Visit this link for a carefully crafted and designed course on-campus placements and interview preparation.

Live masterclass