Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
Well, we all things to be more transparent. We all want to know the inner something that’s inside. The same goes for the classification model. We merely depend on the performance metrics to weigh the model performance. However, visualizing the classification results has its charm and gives a clearer picture of how models classify results.
A popular diagnostic for visualizing the decisions made by a classification model is decision surface/boundary. A decision surface is a plot that shows how a fit machine learning algorithm divides the input feature space by class label.
A decision surface is a powerful tool for understanding how a given model visualizes the prediction and how it decides to divide the input feature space by class label.
Decision Surface
Classification in machine learning means training our data to assign labels in our input dataset.
Each input feature defines an axis on the feature space. The minimum number of features required to form a plane is two, with dots representing input coordinates in the feature space. If there were three input variables, the feature space would be a three-dimensional volume.
The motive of the classification model is to separate the feature space so that we can decide the class label for points in the feature space with minimum error.
This separation is done by decision surface or boundary, and it works as a demonstrative tool for visualizing the model on a classification predictive modeling task.
The data points lying to one side of the decision surface belong to one class label to those lying on the other side of the surface. Due to the model learning process, we can create or modify decision boundaries.
Although the word ‘surface’ suggests a 2-D feature space, we can still use these methods for more than two features by creating a decision surface for each pair of input features.
Now, let's look at the implementation part to get a clearer picture. We will be using logistic regression classifier for our implementation.
Implementation
We will be using Breast Cancer Wisconsin(Diagnostic) dataset for our work.
Importing all the necessary libraries
import pandas as pd
import numpy as np pd.set_option("display.max_rows", None, "display.max_columns", None) from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.decomposition import IncrementalPCA from sklearn.model_selection import GridSearchCV from sklearn.linear_model import LogisticRegression from sklearn import metrics import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
Incremental Principal Component Analysis selects two features to explain as much variance as possible. We apply PCA in both testing and training data.
Plotting the Scatter-Plot for both Training and Testing dataset
plt.figure(figsize = (20, 6)) plt.subplot(1, 2, 1) plt.scatter(x_train_pca[:,0], x_train_pca[:,1], c = y_train) plt.xlabel('Training 1st Principal Component') plt.ylabel('Training 2nd Principal Component') plt.title('Training Set Scatter Plot with labels indicated by colors, (0)-Violet,(1)-Yellow') plt.subplot(1, 2, 2) plt.scatter(x_test_pca[:,0], x_test_pca[:,1], c = y_test) plt.xlabel('Test 1st Principal Component') plt.ylabel('Test 2nd Principal Component') plt.title('Test Set Scatter Plot with labels indicated by colors, (0)-Violet (1)-Yellow') plt.show()
We can see the distinction between the two classes and possibly imagine the decision surface, maybe a correct diagonal between the two.
We perform a 5-fold grid-search cross-validation on logistic regression classifier on the training set.
Best Hyperparameter from Grid-Search CV performed Above
print(model_cv.best_params_)
Output
{'C': 10}
Re-Training the model with best parameters
model = LogisticRegression(C = 10).fit(x_train_pca, y_train)
We re-train our model with c=10 obtained after performing hyperparameter tuning.
Predictions
y_train_pred = model.predict(x_train_pca)
y_test_pred = model.predict(x_test_pca)
Performance Analysis of the model in terms of different performance metrics
print('Training Accuracy of the Model: ', metrics.accuracy_score(y_train, y_train_pred)) print('Test Accuracy of the Model: ', metrics.accuracy_score(y_test, y_test_pred)) print()
print('Training Precision of the Model: ', metrics.precision_score(y_train, y_train_pred)) print('Test Precision of the Model: ', metrics.precision_score(y_test, y_test_pred))
Visualization of Decision Surface
We can create a decision boundary by fitting the model on the training data, then using the same model to make predictions for a grid of values for the input domain.
Once we have the grid of predictions, we can plot the values and their class label.
The best possible approach to visualize decision boundaries is to use a contour plot that can interpolate the colors between the points. We can use the contourf()function for plotting the decision surface.
We have to follow specific steps.
Firstly, we need to define the grid points in the whole feature space.
To do this, first, we find the maximum and minimum values for each feature and increase it by one step beyond that to ensure that the whole feature space is covered.
An arrange() function creates a uniform sample at a particular resolution across each dimension. We will use the meshgrid() function to create a grid of the two input vectors.
The contourf() function takes different grids for each of the axes. Then, we plot the decision surface with a two-color colormap.
plt.figure(figsize = (20, 6)) plt.subplot(1, 2, 1) plt.contourf(xx_train, yy_train, Z_train) plt.scatter(x_train_pca[:, 0], x_train_pca[:, 1], c = y_train, s = 30, edgecolor = 'k') plt.xlabel('Training 1st Principal Component') plt.ylabel('Training 2nd Principal Component') plt.title('Scatter Plot with Decision Boundary for the Training Set') plt.subplot(1, 2, 2) plt.contourf(xx_test, yy_test, Z_test) plt.scatter(x_test_pca[:, 0], x_test_pca[:, 1], c = y_test, s = 30, edgecolor = 'k') plt.xlabel('Test 1st Principal Component') plt.ylabel('Test 2nd Principal Component') plt.title('Scatter Plot with Decision Boundary for the Test Set') plt.show()
So we can see how the contourf() function plotted a beautiful decision boundary. With the help of the above screenshot, we can visualize how the input features are assigned their class labels.
Frequently Asked Questions
How is the optimal decision boundary determined? A classification problem is a rule that partitions the features and assigns features of a partition to the same class. The ‘boundary’ of this partitioning is the decision boundary of the rule. The boundary that this rule produces is the optimal decision boundary.
How do you determine decision boundaries in logistic regression? The logistic regression decision boundary is the set of all points that satisfy the equation P(y=1|x)=P(y=0|x)=½.
How does decision tree decision boundary differs from that of logistic regression? Logistic regression decision boundary divides the feature space into precisely two halves with the help of a single line, whereas decision trees divide the space into smaller and smaller areas.
What kind of decision boundary is built by a logistic regression classifier? In the case of logistic regression, the decision boundary is a straight line,i.e., it comes up with a hyperplane live SVM that divides the feature space into two different classes.
Key Takeaways
Let us brief the article,
Firstly, we saw a decision boundary, enhancing our visualization and providing a clear explanation of how data inputs are classified. Lastly, we saw how to implement a decision boundary using logistic regression classifier.
I recommend you all apply the same steps using another classification model to understand it better. Thus we can use decision surface and performance metrics to evaluate the model's performance.
That is the end of the article. Stay tuned for more exciting articles.
Keep Learning Ninjas!
Live masterclass
Amazon PowerBI & AI Essentials: Data Visualization Tips
by Abhishek Soni
01 May, 2025
01:30 PM
Microsoft SDE Roadmap: Use AI Tools to Succeed
by Pranav Malik
28 Apr, 2025
01:30 PM
JioHotstar Sports Analytics: IPL Dataset
by Prerita Agarwal
29 Apr, 2025
01:30 PM
Google SDE interview: Tips to get shortlisted
by Shantanu Shubham
30 Apr, 2025
01:30 PM
Amazon PowerBI & AI Essentials: Data Visualization Tips