## Introduction

In Data Mining, the C4.5 algorithm is utilized as a Decision Tree Classifier, which can be used to decide based on a sample of data (univariate or multivariate predictors).

So, before we jump right into C4.5, let's talk about Decision Trees and how they may be used to classify data. Let's first brush up on our concepts of decision trees, and then we can go into the algorithm.

## Decision Trees

We have all used decision trees many times, and it's nothing but a tree of decisions where the current decision depends on the previous decisions, and there are always a plethora of choices at each level.

The figure below explains it in a nutshell. We have some decisions to make at each node, and our past choices influence the current decision.

### Information Gain

Suppose we have accumulated information through time that allows us to anticipate whether or not something will occur reliably. In that case, the information you have provided about the event you have predicted is not new. However, if the situation deteriorates and an unexpected event occurs, the information is helpful and required.

Let's pretend we're watching a coin toss. Any side of a fair coin has a 50% chance of being expected. If the coin is unfair, and the likelihood of getting a HEAD or TAIL is 1.00, we claim the Entropy is minimal since we can predict the coin toss outcome without any trials.

### Entropy

This is the formula of Entropy and is maximum when we don't know anything about the system, now, more Entropy means less knowledge about the system, so we like to have the Entropy of the system decrease.

Hence when we want to split a node, we would like that decrease of Entropy to be maximum.

Hence, we do this when we split a node in the tree.

We examine each node for the likelihood of splitting. The Information Gain Ratio (m/N = p) and (n/N = q) are the ratios of observations to total observations (m/N = p) and (n/N = q), respectively, where m+n=Nm+n=N and p+q=1p+q=1. If the following node's Entropy is less than the Entropy before splitting, and if this value is the smallest among all conceivable test cases for splitting, the node is divided into its purest elements.

Here is the pseudo-code of the algorithm to build a decision tree.

- Check for the above base cases.
- For each attribute a, find the normalized information gain ratio from splitting on a.
- Let a_best be the attribute with the highest normalized information gain.
- Create a decision node that splits on a_best.
- Recurse on the sublists obtained by splitting on a_best and adding those nodes as children of the node.

The actual algorithm also uses some more concepts like pruning handles continuous and discrete data as well, which are just minor adjustments that are made to the above process of creating a decision tree.

We can implement these algorithms using predefined libraries like Sklearn, which uses similar algorithms to create a decision tree and categorize the input.

Recommended topic: __Machine Learning__