Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Last Updated: Mar 27, 2024

Batch Normalization - Introduction

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM


Batch normalization, sometimes known as Batch Norm, is currently a commonly utilized approach in the realm of Deep Learning. It improves the learning speed of Neural Networks and provides regularization, avoiding overfitting.

Batch Norm is a critical component in the current deep learning practitioner's toolkit. It was quickly acknowledged as being transformative in generating deeper neural networks that could be trained faster once introduced in the Batch Normalization study.

Batch Norm is a neural network layer that may currently be found in various topologies. It's frequently included as part of a Linear or Convolutional block, and it aids in network stabilization during training.

Let's start with normalization to understand better how Batch Norm works and why it's necessary.


The term "normalization" refers to a pre-processing approach for standardizing data. Put another way, having multiple data sources inside the same range. Not normalizing the data before training can cause problems in our network, making it drastically harder to train and decreasing its learning speed.

Let's pretend we have a bike rental service. First, we want to estimate a reasonable price for each bike based on data from competitors. Each motorcycle has two characteristics: its age in years and the number of kilometers driven. These can have a wide variety of durations, ranging from 0 to 20 years, and a vast more minor range of distances driven, ranging from 0 to hundreds of thousands of kilometers. We don't want features with these disparities in ranges since the value associated with the higher range could lead our models to overvalue them.

To normalize our data, we have two options. Scaling it to a range of 0 to 1 is the easiest method: 

x is the data point to normalize, m is the data set's mean, xmax is the most significant value, and xmin is the lowest. This technique is commonly employed in data inputs. In Neural Networks, non-normalized data points with broad ranges can induce instability. Significant inputs can cascade down to the layers, resulting in issues like ballooning gradients.

Another method for normalizing data is to apply the following formula to force the data points to have a mean of 0 and a standard deviation of 1:

Being x the normalized data point, m the data set's mean, and s the data set's standard deviation Each data point now has a normal distribution like a conventional normal distribution. With all of the features on this scale, none of them will be biased, allowing our models to learn more effectively.

This last technique is used in Batch Norm to normalize data batches throughout the network.



The effect of normalizing data can be seen in the image above. The original values (in blue) have been moved to the center of the graph (in red). As a result, all feature values are now on the same scale.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

Batch Normalization

Batch Norm is a normalizing technique between layers of a Neural Network rather than in the raw data. Instead of using the entire data set, it is done in mini-batches. Its purpose is to facilitate learning by speeding up training and utilizing higher learning rates.

Normalization is usually applied to input data, but it makes sense to keep internal data flow within the network normalized.

Within the input values passed between the layers of a neural network, BN is the internal enforcer of normalization. The covariate shift to the activations inside the layers is usually limited by internal normalization.

As previously stated, the BN approach works by applying a sequence of operations to the data that enters the BN layer.

We may define the normalizing formula of Batch Norm as follows, using the technique described in the preceding section:


mz : mean of the neuron’s output 

sz : standard deviation of the neuron’s output

How Batch Normalization is Applied

A standard feed-forward Neural Network is shown in the image below: The inputs are xi, the neuron’s output is z, the activation functions output is a, and the network's output is y:



Before applying the activation function, Batch Norm – indicated by a red line in the image – is applied to the neurons' output. 

A neuron without a Batch Norm is often computed as follows with the model learning the parameters w and b:



g(): the linear transformation of the neuron

w: weights of the neuron

b: bias of the neuron

f(): activation function


Adding Batch Norm, it looks as:


mz : mean of the neuron’s output 

sz : standard deviation of the neuron’s output

β: learning parameters of Batch Norm

γ: learning parameters of Batch Norm

The components are parameter vectors that are used to scale(γ) and shift(β) the vector that contains the results of the previous operations. The values of the scaling and shifting parameter vectors are learnable parameters. BN ensures that the learnable parameters are the best values for proper normalization of each mini-batch during neural network training.

Benefits of Batch Normalization

  • Hyperparameter tuning is less sensitive in this model. While higher learning rates have previously resulted in non-valuable models, higher LRs are now acceptable.
  • Gradients' dependency on the scale of the parameters or their underlying values is reduced.
  • Within the neural network, there is less covariate shift.
  • Incorporating Batch Normalization into deep neural networks reduces training time.

Frequently Asked Questions

Q1) What is bias?

The gap between the current model's average prediction and the actual results we need to anticipate is known as bias.

Q2) What is Variance?

The distribution (or clustering) of the model outputs on a data point is known as a variance. The higher the conflict, the more probable the model is focused on training data and does not generalize data that has never been observed.

Q3) Is it possible to train a neural network model by setting all biases to 0?

Yes, even if all of the biases are set to zero, the neural network model has a chance of learning.

Q4) Why is it critical to include nonlinearities in a neural network?

Without non-linearities, no matter how many layers are present, a neural network will behave like a perceptron, with the output being linearly dependent on the input.

Key Takeaways

The BN algorithm isn't nearly as hard as it appears. With the development of current machine learning libraries like TensorFlow and PyTorch, the complexity of BN implementation within a neural network is abstracted. This makes BN implementation in a neural network a breeze, found here.

Topics covered
Batch Normalization
How Batch Normalization is Applied
Benefits of Batch Normalization
Frequently Asked Questions
Key Takeaways