Table of contents
1.
Introduction
2.
What is Supervised Learning?
3.
Types of Supervised Learning
3.1.
Classification 
3.2.
Regression
4.
What is Classification?
4.1.
Training phase 
4.2.
Predicting phase 
5.
What is a Classification Algorithm?
5.1.
Example 2
6.
Learners in Classification Problems
6.1.
Lazy Learners
6.2.
Eager Learners
7.
Types of Classification Tasks
7.1.
Binary Classification
7.2.
Multiclass Classification
7.3.
Multi Label Classification
7.4.
Imbalance Classification
7.5.
Hierarchical Classification
8.
Types of ML Classification Algorithms
8.1.
Linear Models
8.2.
Non-linear Models
8.3.
Logistic Regression
8.4.
Support Vector Machines
8.5.
Kernel Support Vector Machines
8.6.
Decision Trees
8.7.
Random Forests 
8.8.
Naive Bayes 
8.9.
K Nearest Neighbors
9.
Evaluating a Classification Model
9.1.
Confusion Matrix
9.2.
ROC AUC Curve
9.3.
Accuracy
9.4.
Precision
9.5.
Log Loss or Cross Entropy Loss
9.6.
Cross Validation
9.7.
Holdout Method
10.
How to Select the Best Algorithm
11.
Use of Classification Algorithm in Real Life
12.
Frequently Asked Questions
12.1.
What is a classification in machine learning?
12.2.
What is classification learning?
12.3.
What is difference between regression and classification?
12.4.
What is classification in machine learning Python?
13.
Conclusion
Last Updated: Mar 27, 2024
Easy

Classification Algorithm in Machine Learning

Author Gaurav Gandhi
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Classification in machine learning are techniques to analyze data patterns and assign categorical labels to new unseen data. These are multiple classification algorithms to make predictions and classify data into distinct classes based on their features.

Classification in machine learning

We will look into classification in machine learning in more depth.

What is Supervised Learning?

Supervised learning is a machine learning-based approach where machines are trained using labeled data to make predictions. We can make predictions by mapping input variables to output variables. Some examples of supervised learning include:

  • We must have used various online platforms that predict movies and products. They use supervised learning to make personalized predictions on movies and products based on user inputs. 
     
  • There are many voice assistants in the market. These voice assistants use supervised learning to recognize and interpret spoken commands enabling them to respond.
     
  • We can filter spam in emails to filter out unwanted spam.
     

Types of Supervised Learning

There are two types of supervised learning, namely:

Classification 

Classification in machine learning is a supervised learning task that involves assigning classes to input data.

Regression

Regression is a supervised learning task that predicts continuous numerical variables based on input variables. 

What is Classification?

Classification is a task that involves labelling data into distinct classes on the basis of features and characteristics. The purpose of classification algorithm is to accurately predict the class of the new unseen data based on the features from the training data. 

The classification task usually comprises of two stages: 

Training phase 

The training phase is used to train the model by identifying features and labels. 

Predicting phase 

The predicting phase predicts the class of the new unseen data. 

In this article, we will look at classification algorithms in detail.

What is a Classification Algorithm?

A classification algorithm is a supervised learning algorithm used to predict the class of the data it belongs to based on the labeled data. 

Classification algorithm

We can understand the classification algorithm from the figure shown above. There are two classes: class A and class B where each class has similar feature objects which are different from objects in other classes.

A classifier refers to the algorithm that is used to perform classification on a dataset. We will learn more about classifiers later.

The classification process involves the following:

  • Training the algorithm on a labeled dataset.
     
  • Categorizing them into classes.
     
  • Classifying the unseen data into one of the classes.
     

Example 1

Let's say we have a sequence of emails coming, and we want to segregate them based on spam and not spam.

Suppose we have training data as:

Email 1: “We have detected a virus in your device”.
Class: spam
Email 2: “Hope you are doing well”.
Class: not spam
Email 3: "Get rid of the virus in your device".
Class: spam
Email 4: “Let’s have a meeting”.
Class: not spam


Now let’s say we get a new mail that says “This virus can kill your device”.

Therefore, now we can label the new mail as “spam” based on the trained data as the word “virus” comes into the new mail.

Example 2

Let’s take another example where we decide whether to play cricket or not based on the weather conditions. 

Classification algorithm example

Here, the weather condition is the root node and it has three branches: “sunny”, “rainy” and “overcast”. The decision tree reaches the final attribute. 

Here, based on the root to leaf path a person can decide whether he/she should play cricket or not.

Learners in Classification Problems

There are mainly two types of learners in machine learning:

Lazy Learners

Lazy learners store the training data and classify the most similar cases when test data arrives. It is called lazy learners because it spends less time on data training and more time on data testing. For example, K nearest neighbor.

Eager Learners

Eager learners, however, store and develop a classification model on the training data before receiving any testing data. It spends more time on the training of data and spends less time on the testing of data. For example, Decision trees.

Types of Classification Tasks

Classification in machine learning are supervised learning tasks that categorize the data based on the training data. Here are some examples of classification tasks:

Binary Classification

This task involves classification into one of two possible classes. For example, spam or not spam, true or false, etc.

Multiclass Classification

This task involves classification into more than two possible classes. For example, image recognition in the case of cats, dogs, and cows.

Multi Label Classification

This involves assigning multiple labels to each instance. For example, text categorization is where the text can be assigned to multiple tags.

Imbalance Classification

In imbalance classification, there is an imbalance between the number of instances in different classes. For example, credit card fraud detection.

Hierarchical Classification

In hierarchical classification, the classes are organized into a hierarchy. Each instance is classified into a specific class and its ancestors in the hierarchy. For example, in a food chain where a tiger is at the top of the hierarchy, deer is below the tiger, and the grass is at the bottom.

Types of ML Classification Algorithms

Classification Algorithms are divided into two categories:

Linear Models

  • Logistic Regression
  • Support Vector Machines

Non-linear Models

  • Kernel Support Vector Machines
  • Decision Tree
  • Random Forest
  • Naive Bayes
  • K-Nearest Neighbours


We will discuss some of them in detail:

Logistic Regression

Logistic regression is a classification algorithm that models a relationship between input variables and the probability of belonging to a specific class. For example, predicting whether a loan applicant will default or not based on the credit score and income using logistic regression.

Support Vector Machines

Support vector machines (SVM) is a type of algorithm that tries to draw a line or boundary that separates different groups of data points. The goal is to find the best line that maximizes the space between the closest data points of other classes. 

For example, we need to classify a dog and a cat based on weight, height, shape of ears, and tail. Here SVM can be used to find the best possible line that separates the two animals, and the new data can be easily fit into one of the two classes.

Kernel Support Vector Machines

Kernel SVM (Support Vector Machines) is a nonlinear classification task which transforms data to high dimensional feature space using kernel functions. It becomes easier to separate different classes using SVM. This allows SVM to effectively handle nonlinear relationships and achieve accurate predictions. 

For example, we can classify different flowers based on petal length and width.

Decision Trees

Decision trees are hierarchical structures that decide based on an input feature. Suppose you want to determine whether you need to wear a jacket. You can take the help of data of temperatures where if the temperature is less than 10 degrees, you need to wear a jacket otherwise, not.

Random Forests 

Random Forests take into account multiple decision trees to make predictions. The final prediction of random forest is made by aggregating the prediction of individual trees.

For example, if we extend the example of a jacket in the last case, let's consider different factors such as temperature, wind speed, and precipitation, and then, from each model, we can decide whether we need to wear a jacket.

Naive Bayes 

Naive Bayes is a probabilistic classifier. It is based on Bayes's theorem. It assumes that features are independent given the class label, known as the "naive" assumption. They can handle high-dimensional data. For example, predicting if an email is a fraud can be done using naive Bayes by taking the help of text classification. 

Naive Bayes is based on bayes theorem which is given as:

P(A|B)= P(A)*P(B|A) / P(B)

In this equation,

P(A|B) is the probability of event A occurring given event B happens.

P(A) is the probability of event A.

P(B) is the probability of event B.

P(B|A) is the probability of event B occurring given event A happens.

K Nearest Neighbors

K nearest neighbors (KNN) is a supervised algorithm that classifies new data points based on their similarity to existing labeled data. It looks at the K nearest neighbors in the training data. Simply, it finds the K closest points and makes predictions based on their labels. 

For example, a recommendation system model which predicts movies for you.

Evaluating a Classification Model

Evaluating a classification model means how well it predicts to the new unseen data. We will look at some common metrics.

Confusion Matrix

A confusion matrix summarizes the performance of a classification model by showing the counts of true positive, false positive, true negative, and false negative predictions. The matrix is shown below.

  Actual Positive  Actual Negative 
Predicted positive  True positive False positive
Predicted negative False negative True negative

ROC AUC Curve

AUC is Area Under the Curve, and ROC stands for Receiver Operating Characteristics Curve. ROC and AUC evaluate how well a classification model can distinguish between different classes. Higher AUC values indicate better classification ability. 

The ROC curve is plotted with True Positive Rate on the Y-axis and False Positive Rate on X-axis.

Accuracy

It measures the correctly classified instances out of the total number of instances. It may be biased in the presence of class imbalance. Based on the above shown table of confusion matrix.

Accuracy = (True Positive + True Negative) / Total Population

Precision

As the name suggests, precision calculates the proportion of true positive predictions (correctly predicted positive instances) out of all positive predictions (both true positives and false positives).

 Precision = True Positive / (True Positive + False Positive)

Log Loss or Cross Entropy Loss

It is a common loss function used for measuring the dissimilarity between predicted probabilities and true labels. In simpler words it tells us how well a model predicts comparing to true outcomes by penalizing incorrect predictions. The lower log loss represents the higher accuracy of the model. It is calculated as:

Log Loss = -1/n * Σ(y * log(p) + (1 - y) * log(1 - p))

Here p is the predicted probability.

‘y’ represents the true label (0 or 1).

‘n’ represents total number of instances.

Cross Validation

Cross-validation evaluates performance of a model by splitting the available data into multiple subsets, called folds, and then iteratively training and testing the model on different combinations of these folds. It is used in case of overfitting of data.

Holdout Method

The holdout method, also known as simple split validation, is a technique that involves splitting the available data into two sets: a training set and a testing set. We train the model on a training set and then evaluated on the testing set, which consists of unseen data. 

The unseen test set evaluates the ability to predict the data after it has been trained using the train set.

How to Select the Best Algorithm

  1. Read and understand the problem statement and data characteristics.
     
  2. Create training and testing data.
     
  3. Evaluate the complexity and scalability of the algorithm.
     
  4. Review the algorithm’s performance.
     
  5. Compare multiple algorithms by evaluating each algorithm.
     
  6. Finally, select the best algorithm.

Use of Classification Algorithm in Real Life

  1. Credit card fraud detection
     
  2. Disease diagnosis in the medical industry
     
  3. Image recognition and object detection
     
  4. Email spam filtering
     
  5. Recommender systems
     
  6. Predicting stock market trends, and many more.

Frequently Asked Questions

What is a classification in machine learning?

Classification in machine learning is the process of categorizing data points into predefined classes or categories based on their features. It's used for tasks like spam detection, image recognition, and sentiment analysis, where algorithms learn to make data-driven predictions about class membership.

What is classification learning?

Classification learning is a type of supervised machine learning where algorithms are trained to assign data points to predefined categories or classes based on their features. It's used for tasks like image recognition, text classification, and fraud detection.

What is difference between regression and classification?

Regression predicts continuous numeric values, while Classification assigns data to predefined categories. Regression outputs numbers, while Classification assigns labels. They use distinct models, metrics, and are suited for different tasks like predicting prices (Regression) or detecting spam emails (Classification).

What is classification in machine learning Python?

Classification in machine learning with Python is the process of categorizing data into predefined classes or labels based on features. Common libraries like Scikit-Learn provide algorithms for tasks like spam detection, image recognition, and sentiment analysis.

Conclusion

In this article, we discussed classification in machine learning. You can also read the article types of machine learning to improve your knowledge on machine learning.

To learn more, check out our articles:

And many more on our platform Coding Ninjas Studio.

Refer to our Guided Path to upskill yourself in DSACompetitive ProgrammingJavaScriptSystem Design, and many more! If you want to test your competency in coding, you may check out the mock test series and participate in the contests hosted on Coding Ninjas Studio!

But suppose you have just started your learning process and are looking for questions from tech giants like Amazon, Microsoft, Uber, etc. In that case, you must look at the problemsinterview experiences, and interview bundles for placement preparations.

You can also consider our Machine Learning Course to give your career an edge over others.

Happy Learning!

Live masterclass