Working
The KNN method is mainly employed as a classifier, as previously stated. Let's look at how KNN classifies unknown data points. Unlike artificial neural network classification, k-nearest neighbours classification is straightforward to understand and implement. It's suitable for scenarios with well-defined or non-linear data points.
Algorithm
The following algorithm can be used to describe how K-NN works:
- Decide on the number of neighbours (K).
- Determine the Euclidean distance between K neighbours.
- Using the obtained Euclidean distance, find the K closest neighbours.
- Count the number of data points in each category among these k neighbours.
- Assign the new data points to the category with the greatest number of neighbours.
- The model is completed.
Example
Let's assume we have a new data point that needs to be placed in the appropriate category.
Firstly, we will choose the number of neighbours (the value of k), so we will choose k = 3. We will then calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points that we learned in geometry. Suppose “m” and “n” are two points in a real line, and the distance between them can be calculated as:
(equation for the distance between two points)
Suppose we found the closest neighbours by computing the Euclidean distance, which gave two closest neighbours in category A and one closest neighbour in category B.
The two closest neighbours are all from category A; hence this new data point must also be from that category.
Implementation
The code below uses the sci-kit-learn library to implement KNN on the iris dataset. Each species of Iris flower includes 50 samples in the Iris dataset (a total of 150). We have sepal length, width, petal length and width for each sample and a species name (class/label).
Step 1(Load the iris dataset and check the data)
First, we need to create an iris Bunch object using the load iris function from the sci-kit-learn datasets module. Then we have to print the data to check if the data has been loaded.
#program for knn classifier iris dataset using sklearn
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
#importing iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
print(iris.data)
You can also try this code with Online Python Compiler
Run Code
Output
Step 2(Check for the features and target)
Now we can check for all the available attributes in the iris dataset. We have then printed the targets, which are integers representing the iris species:
- 0: Setosa
- 1: Versicolor
- 2: Virginica
In the end, we have printed the target names.
#names of the features
print(iris.feature_names)
#print the target
print(iris.target)
#names of the target_names
print(iris.target_names)
You can also try this code with Online Python Compiler
Run Code
Output
Step 3(Split the dataset into training and testing data)
Because training and testing on the same data is an inefficient way for classification, we partition the data into two sets: training and testing. We use the “train test split” function to split the data. The optional parameter ' test-size determines the split percentage.' The 'random state' parameter ensures that the data is split in the same way each time the programme is executed. Because we're training and testing on distinct sets of data, the testing accuracy will better indicate how well the model will perform on new data.
#split the data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=3)
#shape of train and test objects
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
You can also try this code with Online Python Compiler
Run Code
Output
Step 4(Training and Testing the Classifier)
In our example, we've created an instance (“knn”) of the class 'KNeighborsClassifier,' which means we've constructed an object called “knn” that knows how to perform KNN classification once the data is provided. The tuning parameter/hyperparameter (k) is the 'n neighbours' parameter.
The 'fit' method is used to train the model using training data (X train,y train), while the 'predict' approach is used to test the model using testing data (X test). Because determining the best K value is crucial, we use a for loop to fit and test the model for various K values (from 1 to 25) and keep track of the K-NN's testing accuracy in a variable (scores).
k_range = range(1,26)
scores = []
for k in k_range:
#create a knn classifier
knn = KNeighborsClassifier(n_neighbors=k)
#train the model
knn.fit(X_train, y_train)
#predict the response
y_pred = knn.predict(X_test)
#compute the accuracy of the model
scores.append(knn.score(X_test, y_test))
#print the accuracy
print('Scores:',scores)
#print the highest accuracy
print('Max Score:')
print(max(scores))
#print the value of k with the highest accuracy
print('Value of k with Highest Accuracy:')
print(scores.index(max(scores))+1)
#plot the accuracy
import matplotlib.pyplot as plt
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')
plt.show()
You can also try this code with Online Python Compiler
Run Code
Output
Real-World Application of KNN Algorithm
Some real-world applications of the KNN algorithm are:
-
Agriculture
The KNN model is being used to classify the groundwater level dataset in order to forecast future test data records. It could be beneficial in analysing past groundwater levels and forecasting future levels.
-
Healthcare
It is used in healthcare in various ways. For instance, it can predict whether a patient who has been admitted to the hospital for a heart attack will suffer another heart attack. The prediction will be based on the patient's demographics, diet, and clinical data. Using clinical and demographic data, it can determine the risk factors for prostate cancer.
-
Recommendation system
KNN isn't ideal for high-dimensional data, but it's a great starting point for recommendation systems. Netflix, Amazon, YouTube, and a slew of other companies provide individualised recommendations to their customers.
Explore more:
Frequently Asked Questions
What does the letter "K" mean in the KNN Algorithm?
“K” is the number of nearest neighbours you want to use to predict the class of a given item from an unknown dataset.
When should we refrain from using KNN?
It does not perform well with high dimensions. KNN algorithms do not work well with high-dimensional data because calculating the distance between each dimension becomes more complex as the number of dimensions increases.
Are outliers a factor in KNN?
The existence of outliers in the experimental datasets is found to have a negative impact on the kNN algorithm's classification accuracy. An outlier score based on rank difference can be awarded to the points in these datasets by considering the distance and density of their local neighbourhood points.
What is the origin of the name "Lazy Learner" for the KNN Algorithm?
When the KNN algorithm receives the training data, it does not learn or build a model; instead, it simply stores it. Instead of using the training data to find a discriminative function, it uses instance-based learning and only uses the training data when it needs to predict unknown datasets.
Conclusion
In this article, we have extensively discussed the KNN Algorithm and its implementation in Python using the SK-learn library.
We discussed the need for the K-NN algorithm, its working principle, algorithm, and implementation in 4 steps as below:
- Load the iris dataset and check the data
- Check for the features and target
- Split the data into training and testing data
- Training and Testing the Classifier
We hope that this blog has helped you enhance your knowledge of the K-NN algorithm and if you would like to learn more, check out our articles on ‘Data preprocessing’ and ‘Outliers in Data Analysis’, ‘Types of Machine Learning’, ‘Google Colab’, ‘Applications of AutoML’, ‘What is AutoML in Machine Learning’, ‘NumPy Basics’, ‘Random Generators in NumPy’. Do upvote our blog to help other ninjas grow.
Recommended Problem - K Closest Points To Origin
Head over to our practice platform Coding Ninjas Studio to practice top problems, attempt mock tests, read interview experiences, interview bundle, follow guided paths for placement preparations and much more!
Happy Reading!