ID3 Algorithm in Machine Learning

Introduction

The ID3 (Iterative Dichotomiser 3) Algorithm in Machine Learning is a popular decision tree algorithm used to classify data. It works by selecting the attribute that provides the maximum information gain for splitting the data.

In this article, we will explain how the ID3 Algorithm in Machine Learning works, using some practical examples. You will learn the key mathematical concepts behind it, which are essential for building decision trees.

What are Decision Trees?

Decision Trees are popular as they help in deriving a strategy to reach our end goal. They are structured as a tree, starting from a single node (root) that branches off into possible outcomes or decisions based on certain conditions. This structure helps in making decisions by splitting data into smaller subsets, which makes complex decision-making processes more manageable and interpretable.

ID3 Algorithm Steps:

The ID3 (Iterative Dichotomiser 3) algorithm is pretty easy and a powerful algorithm used to construct decision trees. The algorithm involves several key steps:

Selecting the Best Attribute: Begin by selecting the best attribute that splits the data into subsets. This is done using a metric named as information gain, which measures how well an attribute separates the data into groups based on the target attribute.
Tree Construction: Use the best attribute as a decision node and branch off from it for each possible value of the attribute. This process is mainly for partitioning the data.
Recursive Splitting: Repeat the process for each branch using the remaining attributes. Stop if all instances in a branch are the same or no more attributes are available.
Pruning (Optional): Simplify the tree by removing branches that have little effect on the decision-making process to reduce overfitting and improve the model's generalizability.

How does ID3 Algorithm Works?

The ID3 algorithm builds a decision tree by selecting the attribute that separates the data into different classes in the best way possiblle. Here’s a step-by-step overview of how the algorithm works:

Start with the Entire Dataset: The algorithm begins by considering the entire dataset as a whole.
Calculate Entropy: Entropy measures the level of uncertainty or impurity in a dataset. It helps to determine how well a dataset is mixed or split between different classes. In decision trees, entropy is used to calculate the best feature to split the data on.
The formula for entropy is:

where pi is the proportion of examples in class i,
n is the total number of classes.
Determine Information Gain for Each Attribute: Information Gain is the reduction in entropy achieved by splitting the data based on an attribute. The attribute with the highest Information Gain is selected for the split. The formula for Information Gain is

Where:

E(S) is the entropy of the entire dataset S,
Sv is the subset of S where attribute A has the value v,
|Sv∣ is the size of the subset Sv,
∣S∣ is the size of the dataset S,
E(Sv) is the entropy of the subset Sv.
Split the Dataset: The dataset is split based on the chosen attribute, and the process is repeated for each subset until all data points are perfectly classified, or no further splits can be made.
Create Leaf Nodes: Once the data is fully classified, the nodes at the ends of the branches become leaf nodes, representing the final decision or classification.

Mathematical Concepts of ID3 Algorithm

The ID3 algorithm depends mainly on two main mathematical concepts: Entropy and Information Gain.

1. Entropy

Entropy measures the level of uncertainty in a dataset. In decision trees, it quantifies the randomness or impurity present. Low entropy means most data points belong to one class, while high entropy shows a mix of classes. For example, if all data points in a dataset are classified as "Yes," the entropy will be zero due to no uncertainty. On the other hand, a 50/50 split between "Yes" and "No" indicates maximum entropy due to higher uncertainty.
Let's calculate entropy for a simple dataset:

Python

Python

import math

def entropy(probabilities):

    return -sum(p * math.log2(p) for p in probabilities if p != 0)

# Example: Entropy of a dataset with 9 Yes and 5 No

probabilities = [9/14, 5/14]

print(entropy(probabilities))

You can also try this code with Online Python Compiler

Run Code

Output

0.940

2. Information Gain

Information Gain measures how well an attribute separates the data. It is calculated as the difference between the entropy of the dataset before the split and the weighted average of the entropy after the split.
For example, in a dataset where splitting based on the "Outlook" attribute reduces the entropy the most, the ID3 algorithm will select "Outlook" as the root node of the decision tree.
Here's a code snippet to calculate Information Gain:

Python

Python

def information_gain(entropy_before, subsets):

    total_samples = sum(len(subset) for subset in subsets)

    weighted_entropy = sum((len(subset) / total_samples) * entropy(subset) for subset in subsets)

    return entropy_before - weighted_entropy


# Example: Information Gain for splitting by Outlook

entropy_before = 0.940

subsets = [[3/5, 2/5], [4/4], [2/5, 3/5]]

print(information_gain(entropy_before, subsets))

You can also try this code with Online Python Compiler

Run Code

Output

0.247

Practical Implementation of ID3 Algorithm

Now, let's see how to implement the ID3 algorithm in Python using the sklearn library. We will use a sample dataset to build a decision tree and classify the data.

Python

Python

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier, export_text




# Load the Iris dataset

data = load_iris()

X, y = data.data, data.target




# Initialize and fit the ID3 algorithm (using DecisionTreeClassifier)

model = DecisionTreeClassifier(criterion='entropy')

model.fit(X, y)




# Export the decision tree in text format

tree_rules = export_text(model, feature_names=data.feature_names)

print(tree_rules)

You can also try this code with Online Python Compiler

Run Code

Output

|--- petal width (cm) <= 0.80
|   |--- class: 0
|--- petal width (cm) >  0.80
|   |--- petal length (cm) <= 4.95
|   |   |--- class: 1
|   |--- petal length (cm) >  4.95
|   |   |--- class: 2

In this example, we used the Iris dataset to classify different species of iris flowers based on the ID3 algorithm. The decision tree shows how the data is split based on the petal width and length attributes.

Time Complexity

The time complexity of the ID3 algorithm is O(nlog⁡n), where n is the number of samples in the dataset. This is because the algorithm needs to sort the data to calculate entropy and information gain.

Space Complexity

The space complexity is O(n) due to the storage required for the dataset and the decision tree structure.

Advantages of the ID3 Algorithm

Simplicity: The ID3 algorithm is easy to understand and implement. Its clear and logical structure makes it a good choice for students who are beginners in machine learning.
Interpretability: Decision trees created using ID3 algorithm are easy to interpret and visualize. The tree structure allows users to follow the decision-making process, making it easier to explain the model's predictions.
Efficiency with Small Datasets: ID3 algorithm performs efficiently on small to medium-sized datasets, providing quick results without requiring extensive computational resources.
No Need for Data Preprocessing: It can handle both numerical and categorical data without requiring extensive preprocessing, making it a versatile choice for various types of datasets.
Handle Missing Values: The ID3 algorithm can handle missing values in the dataset by using surrogate splits, where it finds alternative attributes that provide a similar information gain.

Limitations of the ID3 Algorithm

Overfitting: One of the main drawbacks of the ID3 algorithm is its tendency to overfit the data, especially when the training dataset contains noise or is too small. Overfitting occurs when the model becomes too complex and captures noise instead of the underlying patterns.
Bias Towards Attributes with More Levels: It tends to favour attributes with a larger number of distinct values, even if they are not the most informative. This bias can lead to suboptimal decision trees.
Handling Continuous Data: Although ID3 can handle continuous data, it does so by discretizing it into intervals. This process may result in a loss of information and less accurate decision trees compared to algorithms designed to work directly with continuous data.
Scalability: The computational complexity of ID3 increases with the size of the dataset and the number of attributes. As a result, it may become inefficient for very large datasets with many features.
Prone to Greedy Splits: The ID3 algorithm uses a greedy approach, selecting the best attribute for each split without considering the overall optimal tree structure. This can lead to suboptimal decision trees that do not generalize well to new data.

Frequently Asked Questions

What is the primary purpose of the ID3 algorithm?

The ID3 algorithm is used to create decision trees that classify data based on the attribute that provides the highest information gain.

Why is entropy important in the ID3 algorithm?

Entropy measures the impurity in a dataset, helping the ID3 algorithm determine how to split the data for the best classification.

Can the ID3 algorithm handle continuous data?

The ID3 algorithm works best with categorical data, but continuous data can be handled by discretizing it into categorical intervals.

Conclusion

The ID3 algorithm is a fundamental machine learning algorithm used to create decision trees that classify data. By understanding how entropy and information gain work, you can see how the algorithm splits data to create a decision tree. With this knowledge, you can start using the ID3 algorithm in your machine learning projects to create effective decision trees.

You can also check out our other blogs on Code360.