Last Updated: Mar 27, 2024
Difficulty: Easy

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Speaker
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM

## Introduction

The terms- Bias, and Variance, you must have heard of them even if youâ€™re new to the domain. But itâ€™s common for budding data scientists to confuse the two. Itâ€™s essential to understand that no machine learning model can be 100% accurate. As a matter of fact, itâ€™s not even supposed to be. There are always going to be some prediction errors - bias and variance. And understanding the bias-variance tradeoff is an integral part of a data scientistâ€™s learning path.

## Bias

Bias is the skewness in the machine learning model occurring due to incorrect assumptions in the machine learning process. Bias can be defined as the error between model predictions and the actual results. Essentially, it describes how well the model captures in the training data set.

• A model which doesnâ€™t capture the trends in training data set well is said to show high bias.
• A model with low bias resembles the trends in the data set.

Characteristics of a high bias model include:

• Failure to capture proper data trends
• Likely to underfit
• Gives an overly simplified view of the data
Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

## Variance

Practically, Variance could be defined as the modelâ€™s flexibility to changes in the data set or how robust the model is.

It is the variability in the model prediction- how adjustable the function is to changes in the data set. More complex models lead to high variance. Models having high bias have low variance and vice versa.

Characteristics of a high variance model include:

• Noisy dataset
• Likely to overfit
• Non-generalised/ complex model
• Accounting for outliers

An appropriate mathematical expression can be given as -

Let the variable we are determining be Y and its covariates be X. We can say that there is a relationship between the 2 variables given as -

Y=f(X) + e

Where e is the error and it is normally distributed having a mean of 0.

We can make a model f^(X) of f(X) using linear regression or any other modeling technique.

So the expected squared error at a point x is

The term can further be expanded as

Here the irreducible error cannot be improved regardless of how well the model is trained. Itâ€™s the flaws in the dataset that are causing the skewness, not the training of the model itself. Real-world data can rarely if at all, be perfect. The data is always going to have some noise.

The above bulls-eye diagram is a good visual representation of what a balanced bias-variance tradeoff is like. Clearly, low bias and low-variance is the most desired outcome while high-bias and high-variance the least.

## What makes the bias-variance tradeoff unavoidable?

A model with fewer parameters than required is likely to underfit and cause the condition of high bias. While a model with too many parameters may cause the model to overfit and hence the condition of high variance.

The right balance of parameters is essential to balance off bias and variance appropriately. Basically, we avoid an overly simplified model and also an overly complex model.

The aim should be to minimise the total error which was discussed earlier. A low total error signifies a good balance between bias and variance.

As we can see in the graph above, an increase in any of the variance or bias increases the total error in the model. The optimum value is achieved only when the bias and variance are balanced.

1. Define the condition of underfitting and overfitting.
A model is said to be underfitting when it doesnâ€™t show satisfactory results even on the training data split. Or we can say that the model is oversimplified. A model is said to be overfitting when it shows good results on the training split component but similar results arenâ€™t projected with the test split. It can be said, the model isnâ€™t generalized enough in this case.

2. How to find the perfect balance between bias and variance?
Choosing the right parameters for the model is essential for maintaining the balance. We tend to avoid an overly simplified model as well as an overly complex model.

3. What is total error?
total error  = bias2 + variance + irreducible error.
Here bias and variance can be dealt with. However, the irreducible error is the result of noise in the dataset and not the training of the model.

## Conclusion

As a beginner, it is essential to know what causes the error in modelsâ€™ predictions and how to rectify them. The bias-variance tradeoff is something that needs to be understood very clearly. The blog explains in detail exactly what causes the imbalance in bias and variance and how it can be dealt with. You may want to take a step further in your journey to become an industry-ready data scientist. You may check out our course on Data Science and Machine Learning curated by industry experts.  Happy Coding!!

Topics covered
1.
Introduction
2.
Bias
3.
Variance
4.
What makes the bias-variance tradeoff unavoidable?
5.