Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Last Updated: Mar 27, 2024

Adaptive Moment Estimation (ADAM)

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM


While training the machine learning model, it is necessary to choose the best optimizer in order to update the weights. A gradient descent algorithm is used to update the weights at basic stages, but a gradient descent algorithm is computationally expensive and not an efficient optimizer.

Certain changes have been made in the gradient descent algorithm, and better optimizers have been discovered. To date, ADAM (adaptive moment estimation) is the best optimizer. To understand the ADAM optimizer, one must have knowledge of gradient descent with momentum and RMS prop optimizers.

Let us briefly discuss gradient descent with momentum and RMS prop optimizers.

Gradient Descent with Momentum

The standard equation of gradient descent algorithm is:

wt = wt-1 - ๐›‚.dL/dwt-1

Here, wt is updated weight, and wt-1 is the current weight.

๐›‚ is the learning rate.

L is the loss function.

During the convergence of the weights to the local minima lot of noise is produced. To reduce the noise(smoothening the convergence), we introduce a term called momentum. The momentum tells that instead of calculating the derivative of the Loss function with respect to wt-1, calculate the Exponential Weighted Average.

Weights and biases are updated according to the following manner:

wt = wt-1 - ๐›‚.Vdwt for weights.

bt = bt-1 - ๐›‚.Vdbt for biases.

Here Vdwt and Vdbt are the components of the Exponential Weighted Average.

Calculation of Vdwand Vdbt.

Vdwt = ๐ฑVdwt-1 + (1 - ๐ฑ)dL/dwt-1 for weights.

Vdbt = ๐ฑVdbt-1 + (1 - ๐ฑ)dL/dbt-1 for biases.

Here ๐ฑ is the hyperparameter; it specifies the importance given to previous updated weight and bias. Generally, the value of ๐ฑ is 0.95.

(Initially declare Vdwt, Vdbt to 0)

The above equations help to smoothen the convergence of weights and biases.

So, the final equation will be:

wt = wt-1 - ๐›‚.Vdwt for weights.

bt = bt-1 - ๐›‚.Vdbt for biases.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

RMS prop

In the standard gradient descent algorithm, we can see that the learning rate ๐›‚ is constant. It means that the same size of steps is taken to reach the minima(convergence point).

What if we dynamically change the learning rate while training the model?

It will converge the weights to global minimal more quickly than the normal gradient descent method. Let us see the method of dynamically updating the learning rate ๐›‚.

The method is the same as the gradient descent method, except for one hyperparameter.

Let us denote a new learning rate ๐›‚โ€˜. The new can be written as:

wt = wt-1 - ๐›‚โ€˜.dL/dwt-1 for weights.

bt = bt-1 - ๐›‚โ€˜.dL/dwt-1 for biases.

The value of ๐›‚โ€˜ will be:

๐›‚โ€˜ = ๐›‚ / sqrt( Sdw + ษ› ) for weights.

๐›‚โ€˜ = ๐›‚ / sqrt( Sdb + ษ› ) for biases.

The value of Sdwt will be:

Sdwt = ๐ฑSdwt-1 + ( 1 - ๐ฑ ).( dL/dwt )for weights.

Sdbt = ๐ฑSdbt-1 + ( 1 - ๐ฑ ).( dL/dbt )2 for biases.

Here ๐ฑ is the hyperparameter; it specifies the importance given to previous updated weight and bias. Generally, the value of ๐ฑ is 0.95.

Using the above equations, we can dynamically change the model's learning rate(๐›‚).

So the final equation will be:

wt = wt-1 - ๐›‚โ€˜.dL/dwt-1 for weights.

bt = bt-1 - ๐›‚โ€˜.dL/dwt-1 for biases.

ADAM optimizer

In gradient descent, with momentum optimizer, the learning rate was constant, whereas, in RMS prop optimizer, there is a lot of noise at the time of convergence.

The ADAM (adaptive moment estimation) optimizer is the combination of both gradient descent with momentum optimizer and RMS prop optimizer. So, smoothening as wells as dynamic learning rate is obtained.

Derivation of ADAM optimizer

Let us define 4 terms Vdw, Vdb, Sdw, Sdb.

Initially, set the values of the four variables to 0.

Calculate dL/dw and dL/db using the current mini-batch.

Vdw and Vdb are used for smoothening(to add momentum).

Vdwt = ๐ฑ1Vdwt-1 + (1 - ๐ฑ1)dL/dwt-1 for weights.

Vdbt = ๐ฑ1Vdbt-1 + (1 - ๐ฑ1)dL/dbt-1 for biases.

Sdw, Sdb is used for dynamic learning rate.

Sdwt = ๐ฑ2Sdwt-1 + ( 1 - ๐ฑ2 ).( dL/dwt )for weights.

Sdbt = ๐ฑ2Sdbt-1 + ( 1 - ๐ฑ2 ).( dL/dbt )2 for biases.

Considering all the above equations we will wt & bt.

w= wt-1 - (๐›‚/sqrt( Sdw + ษ› )).Vdw for weights.

b= bt-1 - (๐›‚/sqrt( Sdb + ษ› )).Vdb for biases.

Later on a term was introduced called bias correction. Here the values of Vdw, Vdb, Sdw, Sdb are changed according to formula below:

Vdwcorrection = Vdw / (1 - ๐ฑ1t)

Vdbcorrection = Vdb / (1 - ๐ฑ1t)

Sdwcorrection = Sdw / (1 - ๐ฑ2t)

Sdbcorrection = Sdb / (1 - ๐ฑ2t)

Therefore, the new formula will be:

w= wt-1 - (๐›‚/sqrt( Sdwcorrection + ษ› )).Vdwcorrection for weights.

b= bt-1 - (๐›‚/sqrt( Sdbcorrection + ษ› )).Vdbcorrection for biases.

Code for ADAM optimizer

def adam(inits, X, Y, learning_rate=0.01, num_of_iter=10, b1=0.9, b2=0.999, Epsilon=1e-6):
    n = len(X)
    a, b = inits
    grad_a, grad_b = lambda x, y: -2*x*(y-(a*x+b)), lambda x, y: -2*(y-(a*x+b))
    v_a, v_b = 0, 0
    s_a, s_b = 0, 0
    a_list, b_list = [a], [b]
    t = 1
    for _ in range(num_of_iter):
        for i in range(n):
            x_i, y_i = X[i], Y[i]
            g_a, g_b = grad_a(x_i, y_i), grad_b(x_i, y_i)
            # computing the first moment
            v_a = b1*v_a + (1-b1)*g_a
            v_b = b1*v_b + (1-b1)*g_b
            # computing the second moment
            s_a = b2*s_a + (1-b2)*(g_a**2)
            s_b = b2*s_b + (1-b2)*(g_b**2)
            # normalisation
            v_a_norm, v_b_norm = v_a/(1 - np.power(b1, t)), v_b/(1 - np.power(b1, t))
            s_a_norm, s_b_norm = s_a/(1 - np.power(b2, t)), s_b/(1 - np.power(b2, t))
            t += 1
            # updating gradient
            g_a_norm = learning_rate * v_a_norm / (np.sqrt(s_a_norm) + Epsilon)
            g_b_norm = learning_rate * v_b_norm / (np.sqrt(s_b_norm) + Epsilon)
            # updating params
            a -= g_a_norm
            b -= g_b_norm
    return a_list, b_list


  1. What is ADAM in neural networks?
    Adam is an optimization solver for the Neural Network algorithm that is computationally efficient, requires little memory, and is well suited for problems that are large in terms of data or parameters or both.
  2. Which is the best optimizer to date?
    ADAM is the best optimizer to date. It trains the neural network in less time and more efficiently.
  3. What is beta in ADAM optimizer?
    The hyper-parameters ฮฒ1 and ฮฒ2 of Adam are initial decay rates used when estimating the first and second moments of the gradient, which are multiplied by themselves (exponentially) at the end of each training step (batch).

Key Takeaways

In this article, we discussed the following topics:

  • Gradient descent with momentum optimizer
  • RMS prop optimizer
  • ADAM optimizer
  • Derivation and implementation of ADAM optimizer

Hello readers, here's a perfect course that will guide you to dive deep into Machine learning.

Happy Coding!

Topics covered
Gradient Descent with Momentum
RMS prop
ADAM optimizer
Derivation of ADAM optimizer
Code for ADAM optimizer
Key Takeaways