Introduction
Do you ever feel like a person lost in a random forest when it comes to deciding whether to use a decision tree or a random forest? Pun intended! How are decision trees different from random forests? To answer this question, let us dive into the topic: decision trees v/s random forests! But first, we have to know what a decision tree is and what a random forest is!
A decision tree comes under supervised machine learning techniques. As the name implies, it uses a tree-like flowchart to display the predictions resulting from a sequence of feature-based splits. A random forest is also a supervised machine learning technique that uses decision tree algorithms to build it. It uses ensemble learning, a method for solving complicated problems by combining several classifiers.
Decision trees v/s random forests, what should you choose to solve the regression and classification problems!
Advantages and disadvantages of Decision Trees
Advantages
The advantages of decision trees are as follows:
- They are simple and easy to understand: Split the data and make nodes of the tree after calculating the Gini index or information gain for each feature until you take all the features into account and have reached the leaf node where the final decision is made.
- It is easy to visualize it: It is just a tree with a root node branching out to other nodes and finally to the leaf node, after all!
- They are fast and can handle both categorical and numerical data!
Disadvantages
The disadvantages of decision trees are as follows:
- A decision tree is prone to overfitting. That is, a decision tree may work well on the training data but may not make a good prediction on testing data!
The code below demonstrates an example of overfitting using the decision tree classifier from the Sklearn library on the iris dataset.
#imports
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import seaborn as sns
#loading the iris dataset
data=load_iris()
#forming dataframe with it
data_df=pd.DataFrame(data.data,columns= ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'])
#features
data_df

#target
target_df=pd.DataFrame(data.target,columns= ["name"])
target_df
![]() |
x=data.data
y=data.target
#splitting iris data to training and testing
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=50,test_size=0.25)
Make a function that builds a decision tree, prints the f1 scores and displays the confusion matrix. The F1 score is nothing but the harmonic mean of Precision and Recall. In this case, we will use the micro F1 score.

def decision_tree_func(x_train,x_test,y_train,y_test):
decision_tree_clf=DecisionTreeClassifier(random_state=42)
#fitting the data into the decision tree classifier
decision_tree_clf.fit(x_train,y_train)
#predicting y for both testing and training data
y_pred=decision_tree_clf.predict(x_test)
y_train_pred=decision_tree_clf.predict(x_train)
print("DECISION TREE:")
print('Training Set Evaluation F1-Score:',f1_score(y_train,y_train_pred, average='micro'))
print('Testing Set Evaluation F1-Score:',f1_score(y_test,y_pred,average='micro'))
#CONFUSION MATRIX
ax = sns.heatmap(confusion_matrix(y_train, y_train_pred), annot=True, cmap='Blues')
ax.set_title('Confusion Matrix for training dataset\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
## Display the visualization of the Confusion Matrix for training data set
print(plt.show())
ax2 = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='Blues')
ax2.set_title('Confusion Matrix for testing dataset\n\n');
ax2.set_xlabel('\nPredicted Values')
ax2.set_ylabel('Actual Values ');
## Display the visualization of the Confusion Matrix for testing data set
print(plt.show())
Now, the moment of truth:
decision_tree_func(x_train,x_test,y_train,y_test)
You can see that the f1 score we got is 1, which clearly shows that the model overfitted on training data.
In the confusion matrix above, for training data, there are 0 false positives and 0 false negatives for every category (0,1,2). Thus we can conclude that all target values in the training dataset were predicted correctly.
The confusion matrix for the testing data shows that:
- All the 11 values belonging to “category 0” were predicted correctly
- 14 values of “category 1” were predicted correctly, while 1 value was given the wrong “category 2.”
- Eleven values of “category 2” were predicted correctly, while one was given the wrong “category 2.”
This model is also performing well on testing data because iris is a small data set with almost no outliers. However, you will see how lousy decision trees can be when datasets have outliers and a vast number of features in the other section.
- Pruning is a cumbersome process. Pruning is a data compression technique used in machine learning and search algorithms to minimize the size of decision trees by deleting non-critical and redundant elements of the tree. Pruning minimizes the final classifier's complexity, which increases predicted accuracy by reducing overfitting.