Do all machine learning algorithms require feature scaling?

Ans. Some algorithms, such as Decision Trees and Ensemble Techniques (including AdaBoost and XGBoost), do not require scaling because their splitting is based on values.

At which step in the code should we do scaling?

Ans. It is essential to perform scaling after splitting the data into training and testing. If we don't do this, there will be data leakage from test data to train data.

What is the best scaling technique and why?

Ans. There is nothing like the most effective transformation/scaling procedure. It all relies on the type of data you have and how well you know your business. In some cases, some scaling techniques improve the accuracy and performance of the algorithms, while other techniques improve accuracy and performance in different models.

Table of contents

Introduction

Why use Feature Scaling?

2.1.

Gradient Descent Concept

2.2.

Distance-Based Algorithms

2.3.

Assumption of Normal Distribution

Normalization v/s Standardization v/s Robust Scaling

Types of Feature Scaling techniques and how to do it

4.1.

Standard Scaler/ Standardization

4.2.

Min-Max Scaler/ Normalization

4.3.

Robust Scaler

4.4.

Gaussian Transformation

4.4.1.

Logarithmic Transformation

4.4.2.

Reciprocal Transformation

4.4.3.

Square Root transformation

4.4.4.

Exponential Transformation

4.4.5.

Box-Cox Transformation

Frequently Asked Questions

Key Takeaways

Last Updated: Aug 13, 2025

Feature Scaling

Author Toohina Barua

Do you think IIT Guwahati certified course can help you in your career?

Yes

Introduction

What should we do when we have a large amount of data with different units like kilograms, meters, and liters and a massive disparity between the values of all the features so that our model is unbiased towards the weighted feature? We feature scale!

We strive to alter the data in Data Processing such that the model can process it without any issues. And one such process is feature scaling, which involves transforming the data into a better version. We can use it to standardize the data set's features into a restricted range.

Why use Feature Scaling?

Consider the house price prediction dataset, which has several attributes with a broad range of values. Many features will be included, such as the number of bedrooms, the house's square footage, and so on.

As you might expect, the number of beds will range from one to five, but the square footage will be between 500 and 2000. In terms of the range of both features, this is a significant difference.

Many machine learning algorithms that use Euclidean distance as a metric to evaluate similarities will fail to recognize the minor feature, in this case, the number of bedrooms, which can turn out to be an essential metric in the real world.

Let's take a closer look at how we could use it in different machine learning algorithms:

Gradient Descent Concept

The goal of linear regression is to identify the best fit line. To do so, we must first use the concept of gradient descent to locate global minima. We can get faster to the global minima if we scale the data.

Source

Where,

w1, w2 = Parameters/ Features

x1, x2 = Values of the features respectively

J(w)= Cost function

Distance-Based Algorithms

In algorithms like KNN, K-means, and Hierarchical Clustering, we use Euclidean distance to locate the closest points; therefore, we should scale the data so that all attributes are equally weighted. If we don’t do this, features with a large magnitude will be weighted far more heavily in distance computations than features with a small extent.

For instance, if we have a dataset that contains the height (Metres) and weight (Kgs) of females (red) and males (blue) of the same age. We have the weight and height of a new person (black), and we have to predict the gender of the new person. In other words, we have to classify the new person into one of the two categories: females and males, based on their height and weight.

Source

Let us look at the KNN algorithm. Assuming that k is 3, we find that the black dot is much closer to 2 of the blue dots than the third red dot. So naturally, we will classify the black dot into the male category. However, we can see that the weight in Kgs has a much larger range than the height in meters, making our model biased towards the males (Don’t want to offend feminists). What should we do about it? We scale the features! After scaling, the graph will look something like this:

Source

Assumption of Normal Distribution

Some models, such as linear and logistic regression, assume that the feature is normally distributed. To make them adequately distributed, we must use transformations such as Logarithmic, Box-Cox, Exponential, and others.

Normalization v/s Standardization v/s Robust Scaling

The debate about normalization vs. standardization vs. robust scaling is a perennial one among machine learning newbies. In this section, I'll expand on the response.

Normalization

Standardization

Robust Learning

When you know that your data does not follow a Gaussian distribution, normalization is a suitable option. Normalization is helpful in algorithms like K-Nearest Neighbors and Neural Networks, which do not presume any data distribution.

The range of Normalization is from 0 to 1. As a result, even if your data contains outliers, normalization will not affect them.

In circumstances where the data follows a Gaussian distribution, on the other hand, standardization can be beneficial. However, this is not the case all the time.

Standardization, unlike normalization, does not have a bounding range.

Standardization subtracts the mean, and then it divides by the standard deviation to ensure that the mean is equal to 0 and the scales become comparable in terms of standard deviation. In the case of outliers, this leads to overperformance.

If your dataset has many outliers, one or both of the most important procedures – Standardization and Normalization – may not perform as well. This problem can be solved with robust scaling.

We can solve the problem of outliers by scaling the dataset suitably with the RobustScaler in Scikit-learn, limiting the range but maintaining the outliers to continue to contribute to feature importance and model performance.

Source

However, whether you use normalization, standardization, or robust learning depends on your problem and the machine learning algorithm you're utilizing at the end of the day. When it comes to normalizing or standardizing your data, there is no hard and fast rule. To get the best results, start by fitting your model to raw, normalized, and standardized data and comparing the results. And if you have a lot of outliers in your dataset, then try using robust learning.

Types of Feature Scaling techniques and how to do it

To demonstrate different feature scaling strategies, we'll use the SciKit-Learn library. You can download the Salary CSV file from here. Let's start coding! We are going to look at the data first before scaling it.

#imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#reading the csv file and making it into dataframe
data = pd.read_csv("../Downloads/Salary_Data.csv")
#function to plot graph after feature scaling 
def make_scatter_plot(data):
    x=data["YearsExperience"].to_numpy()
    y=data["Salary"].to_numpy()
    plt.xlabel("Years of Experience")
    plt.ylabel("Salary")
    plt.scatter(x,y)
    print(plt.show())
print("Data before Scaling")
data.head(5)

You can also try this code with Online Python Compiler