Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
Need for Optimizers in Deep Learning
Types of Optimizers
Gradient Descent
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent with Momentum
Mini-Batch Gradient Descent
How Do Optimizers Work in Deep Learning?
Forward Pass
Calculate Loss
Backward Pass (Backpropagation)
Update Parameters
Frequently Asked Questions
What makes one optimizer better than another?
Can I use the same optimizer for any deep learning project?
How do I know if an optimizer is working well?
Last Updated: Mar 27, 2024

Optimizers in Deep Learning

Author Rinki Deka
0 upvote
Create a resume that lands you SDE interviews at MAANG
Anubhav Sinha
SDE-2 @
12 Jun, 2024 @ 01:30 PM


Optimizers in deep learning are like the unseen heroes behind training neural networks, ensuring they learn accurately and efficiently from the data they're fed. Imagine trying to find the best path through a complicated maze; optimizers help the neural network navigate this maze to find the most effective route towards the correct answers. 

Optimizers in Deep Learning

Through this article, you'll grasp the essentials of optimizers and their pivotal role in deep learning, covering a range of types from the basic Gradient Descent to more advanced ones like Adam.

Need for Optimizers in Deep Learning

Training a deep learning model is akin to tuning a radio to the perfect frequency. Just as you'd adjust the dials to get clear reception, optimizers tweak the neural network's parameters to minimize errors in predictions. Without optimizers, our models might as well be lost at sea, unable to improve or provide any useful insights. They're crucial for reducing the loss function, a mathematical way of measuring how far off our model's predictions are from the actual outcomes. By iteratively adjusting weights & biases within the network, optimizers ensure that, over time, the model's guesses get closer & closer to the truth.

Let's dive into a simple code snippet to illustrate the role of an optimizer in a neural network using Python's TensorFlow library:

import tensorflow as tf
# Define a simple sequential model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
# Compile the model with an optimizer
model.compile(optimizer='sgd',  # Stochastic Gradient Descent

In this example, we compile a model with the Stochastic Gradient Descent (SGD) optimizer, aiming to minimize the categorical crossentropy loss. The optimizer will adjust the model's weights based on the training data, improving its accuracy over time.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

Types of Optimizers

When it comes to teaching a deep learning model, not all methods are created equal. Think of it like choosing the right pair of shoes for a specific sport; you need the right fit to perform your best. In the world of deep learning, we have a variety of optimizers, each designed for specific kinds of tasks and challenges. Let's walk through some of the most common ones:

Gradient Descent

This is the most straightforward approach. Imagine you're at the top of a hill and want to get down to the lowest point. You look around, decide which way is steepest, and take a step in that direction. Repeat until you're at the bottom. In deep learning, Gradient Descent does something similar by making small, steady steps towards reducing errors in predictions.

Here's a simple illustration using Python:

def gradient_descent_update(x, grad, learning_rate):
    return x - learning_rate * grad

In this function, x represents the current position (or parameter value), grad is the gradient (or the steepness of the hill), and learning_rate decides how big a step to take.

Stochastic Gradient Descent (SGD)

This is like Gradient Descent, but instead of considering the entire dataset to decide the next step, it uses just one or a few data points. This can make the steps seem a bit random (hence "stochastic"), but it's much faster, especially with large datasets.

Example code snippet:

def sgd_update(parameters, gradients, learning_rate):
    for param, grad in zip(parameters, gradients):
        param -= learning_rate * grad

Stochastic Gradient Descent with Momentum

Think of this as SGD with a memory. Not only does it consider the current steepness of the hill, but it also remembers the previous steps' directions. This helps in smoothing out the steps and can lead to faster convergence.

A code example:

def sgd_momentum_update(parameters, gradients, velocities, learning_rate, momentum):
    for param, grad, velocity in zip(parameters, gradients, velocities):
        velocity = momentum * velocity + learning_rate * grad
        param -= velocity

Mini-Batch Gradient Descent

This method involves dividing your dataset into small batches and using each batch to update your model's parameters. Here's a basic structure of how it might look in code:

def mini_batch_gradient_descent(model, X, y, learning_rate, batch_size, epochs):
    for epoch in range(epochs):
        for i in range(0, X.shape[0], batch_size):
            X_batch = X[i:i+batch_size]
            y_batch = y[i:i+batch_size]
            gradients = compute_gradients(model, X_batch, y_batch)
            model.parameters -= learning_rate * gradients

In this example, compute_gradients is a function you would define based on your model's architecture to compute gradients for the batch, and model.parameters represents the parameters you're optimizing.


Adagrad adapts the learning rate for each parameter based on the historical gradients. Here's a simplified version of how you might implement it:

def adagrad_update(parameters, gradients, cache, learning_rate, epsilon=1e-8):
    for param, grad in zip(parameters, gradients):
        cache[param] += grad ** 2
        adjusted_lr = learning_rate / (np.sqrt(cache[param]) + epsilon)
        param -= adjusted_lr * grad

In this function, cache stores the sum of squares of the gradients for each parameter, and epsilon is a small number to prevent division by zero.


RMSProp modifies the learning rate for each parameter based on the recent magnitudes of the gradients for that parameter. Here's an example:

def rmsprop_update(parameters, gradients, cache, learning_rate, decay_rate, epsilon=1e-8):
    for param, grad in zip(parameters, gradients):
        cache[param] = decay_rate * cache[param] + (1 - decay_rate) * (grad ** 2)
        param -= (learning_rate / (np.sqrt(cache[param]) + epsilon)) * grad
decay_rate controls the rate of decay for the running average of the squared gradients, influencing how much of the past gradients are considered.


Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, it restricts the window of accumulated past gradients to a fixed size. Here's a basic implementation:

def adadelta_update(parameters, gradients, cache, update_accumulated, decay_rate, epsilon=1e-8):
    for param, grad in zip(parameters, gradients):
        cache[param] = decay_rate * cache[param] + (1 - decay_rate) * (grad ** 2)
        update = - (np.sqrt(update_accumulated[param] + epsilon) / np.sqrt(cache[param] + epsilon)) * grad
        param += update
        update_accumulated[param] = decay_rate * update_accumulated[param] + (1 - decay_rate) * (update ** 2)

In this snippet, update_accumulated stores the accumulated updates, which are used to adjust the parameter updates.


Adam combines the advantages of two other extensions of stochastic gradient descent, namely RMSProp and Momentum. Here's a simple version of Adam:

def adam_update(parameters, gradients, m, v, t, learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8):
    for param, grad in zip(parameters, gradients):
        m[param] = beta1 * m[param] + (1 - beta1) * grad
        v[param] = beta2 * v[param] + (1 - beta2) * (grad ** 2)
        m_corrected = m[param] / (1 - beta1 ** t)
        v_corrected = v[param] / (1 - beta2 ** t)
        param -= learning_rate * m_corrected / (np.sqrt(v_corrected) + epsilon)

In this code, m and v are dictionaries (similar to parameters and gradients) that store the running averages of the gradients and the squared gradients, respectively. t is the timestep (epoch number), which is used to correct the bias in m and v.

How Do Optimizers Work in Deep Learning?

To understand how optimizers work in deep learning, let's compare it to teaching a child to ride a bike. The child tries to balance and pedal, and based on how they wobble or fall, they adjust their movements to stay upright. Similarly, an optimizer adjusts a neural network's parameters to improve its predictions.

Here's a basic rundown:


Just like setting the bike at the starting line, an optimizer initializes the neural network's parameters (weights and biases) at random.

Forward Pass

The network makes a prediction, akin to the child's first attempt to ride towards a destination.

Calculate Loss

We measure how far off the prediction is from the actual answer, similar to observing how far the child veers off the path.

Backward Pass (Backpropagation)

This step involves understanding how each parameter contributed to the error, like figuring out what movements led the child to wobble.

Update Parameters

Based on this understanding, the optimizer adjusts the parameters slightly to reduce the error, just as the child might lean more to one side or pedal differently to maintain balance.

Optimizers differ in how they adjust the parameters. Some might make bold changes at first and then refine them, while others make cautious, incremental adjustments from the get-go.

To illustrate, let's look at a simple code example using Python and TensorFlow. This code snippet shows how an optimizer updates a model's parameters during training:

  • Python


import tensorflow as tf

# Sample model and optimizer

model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])

optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)

# Loss function

loss_fn = tf.keras.losses.MeanSquaredError()

# Training data

x = tf.constant([[1.0], [2.0], [3.0], [4.0]])

y = tf.constant([[2.0], [4.0], [6.0], [8.0]])

# Training loop

for i in range(100):

   with tf.GradientTape() as tape:

       predictions = model(x)

       loss = loss_fn(y, predictions)

   gradients = tape.gradient(loss, model.trainable_variables)

   optimizer.apply_gradients(zip(gradients, model.trainable_variables))



In this example, we:

  • Define a simple model and select Adam as our optimizer.
  • Use a mean squared error loss function to measure how far off our predictions are.
  • Iterate through our training data, allowing the optimizer to adjust the model's weights based on the calculated gradients, aiming to reduce the loss.
  • Each iteration is like the child's attempt at riding the bike, getting steadier with each try.

Frequently Asked Questions

What makes one optimizer better than another?

Different optimizers work well under different conditions. Some are faster but less precise, while others take longer but get closer to the best answer. It's about matching the optimizer to the task at hand.

Can I use the same optimizer for any deep learning project?

You can, but it might not be the best choice. Each project is unique, so picking an optimizer that fits your specific needs usually gives better results.

How do I know if an optimizer is working well?

Watch your model's performance as it learns. If it's getting better at making predictions without taking too long, your optimizer is likely doing a good job.


Optimizers in deep learning are the guiding forces that help our models learn from data and make accurate predictions. By understanding and choosing the right optimizer for your project, you can significantly improve your model's performance. Whether it's the steady pace of Gradient Descent, the quick adjustments of SGD, or the balanced approach of Adam, each optimizer has its place in the toolbox of a deep learning practitioner. Remember, the journey of training a neural network is a blend of science, art, and a bit of intuition, with optimizers playing a crucial role in navigating the complex landscape of machine learning.

You can refer to our guided paths on the Coding Ninjas. You can check our course to learn more about DSADBMSCompetitive ProgrammingPythonJavaJavaScript, etc. 

Also, check out some of the Guided Paths on topics such as Data Structure and AlgorithmsCompetitive ProgrammingOperating SystemsComputer Networks, DBMSSystem Design, etc., as well as some Contests, Test Series, and Interview Experiences curated by top Industry Experts.

Previous article
Hierarchical Planning in AI
Next article
Data Science and Artificial Intelligence
Live masterclass