Table of contents
1.
Introduction
2.
Gradient Descent with Momentum
3.
RMS prop
4.
ADAM optimizer
4.1.
Derivation of ADAM optimizer
4.2.
Code for ADAM optimizer
5.
FAQs
5.1.
Key Takeaways
Last Updated: Mar 27, 2024

Adaptive Moment Estimation (ADAM)

Author soham Medewar
2 upvotes
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

While training the machine learning model, it is necessary to choose the best optimizer in order to update the weights. A gradient descent algorithm is used to update the weights at basic stages, but a gradient descent algorithm is computationally expensive and not an efficient optimizer.

Certain changes have been made in the gradient descent algorithm, and better optimizers have been discovered. To date, ADAM (adaptive moment estimation) is the best optimizer. To understand the ADAM optimizer, one must have knowledge of gradient descent with momentum and RMS prop optimizers.

Let us briefly discuss gradient descent with momentum and RMS prop optimizers.

Gradient Descent with Momentum

The standard equation of gradient descent algorithm is:

wt = wt-1 - 𝛂.dL/dwt-1

Here, wt is updated weight, and wt-1 is the current weight.

𝛂 is the learning rate.

L is the loss function.

During the convergence of the weights to the local minima lot of noise is produced. To reduce the noise(smoothening the convergence), we introduce a term called momentum. The momentum tells that instead of calculating the derivative of the Loss function with respect to wt-1, calculate the Exponential Weighted Average.

Weights and biases are updated according to the following manner:

wt = wt-1 - 𝛂.Vdwt for weights.

bt = bt-1 - 𝛂.Vdbt for biases.

Here Vdwt and Vdbt are the components of the Exponential Weighted Average.

Calculation of Vdwand Vdbt.

Vdwt = 𝝱Vdwt-1 + (1 - 𝝱)dL/dwt-1 for weights.

Vdbt = 𝝱Vdbt-1 + (1 - 𝝱)dL/dbt-1 for biases.

Here 𝝱 is the hyperparameter; it specifies the importance given to previous updated weight and bias. Generally, the value of 𝝱 is 0.95.

(Initially declare Vdwt, Vdbt to 0)

The above equations help to smoothen the convergence of weights and biases.

So, the final equation will be:

wt = wt-1 - 𝛂.Vdwt for weights.

bt = bt-1 - 𝛂.Vdbt for biases.

RMS prop

In the standard gradient descent algorithm, we can see that the learning rate 𝛂 is constant. It means that the same size of steps is taken to reach the minima(convergence point).

What if we dynamically change the learning rate while training the model?

It will converge the weights to global minimal more quickly than the normal gradient descent method. Let us see the method of dynamically updating the learning rate 𝛂.

The method is the same as the gradient descent method, except for one hyperparameter.

Let us denote a new learning rate 𝛂‘. The new can be written as:

wt = wt-1 - 𝛂‘.dL/dwt-1 for weights.

bt = bt-1 - 𝛂‘.dL/dwt-1 for biases.

The value of 𝛂‘ will be:

𝛂‘ = 𝛂 / sqrt( Sdw + ɛ ) for weights.

𝛂‘ = 𝛂 / sqrt( Sdb + ɛ ) for biases.

The value of Sdwt will be:

Sdwt = 𝝱Sdwt-1 + ( 1 - 𝝱 ).( dL/dwt )for weights.

Sdbt = 𝝱Sdbt-1 + ( 1 - 𝝱 ).( dL/dbt )2 for biases.

Here 𝝱 is the hyperparameter; it specifies the importance given to previous updated weight and bias. Generally, the value of 𝝱 is 0.95.

Using the above equations, we can dynamically change the model's learning rate(𝛂).

So the final equation will be:

wt = wt-1 - 𝛂‘.dL/dwt-1 for weights.

bt = bt-1 - 𝛂‘.dL/dwt-1 for biases.

ADAM optimizer

In gradient descent, with momentum optimizer, the learning rate was constant, whereas, in RMS prop optimizer, there is a lot of noise at the time of convergence.

The ADAM (adaptive moment estimation) optimizer is the combination of both gradient descent with momentum optimizer and RMS prop optimizer. So, smoothening as wells as dynamic learning rate is obtained.

Derivation of ADAM optimizer

Let us define 4 terms Vdw, Vdb, Sdw, Sdb.

Initially, set the values of the four variables to 0.

Calculate dL/dw and dL/db using the current mini-batch.

Vdw and Vdb are used for smoothening(to add momentum).

Vdwt = 𝝱1Vdwt-1 + (1 - 𝝱1)dL/dwt-1 for weights.

Vdbt = 𝝱1Vdbt-1 + (1 - 𝝱1)dL/dbt-1 for biases.

Sdw, Sdb is used for dynamic learning rate.

Sdwt = 𝝱2Sdwt-1 + ( 1 - 𝝱2 ).( dL/dwt )for weights.

Sdbt = 𝝱2Sdbt-1 + ( 1 - 𝝱2 ).( dL/dbt )2 for biases.

Considering all the above equations we will wt & bt.

w= wt-1 - (𝛂/sqrt( Sdw + ɛ )).Vdw for weights.

b= bt-1 - (𝛂/sqrt( Sdb + ɛ )).Vdb for biases.

Later on a term was introduced called bias correction. Here the values of Vdw, Vdb, Sdw, Sdb are changed according to formula below:

Vdwcorrection = Vdw / (1 - 𝝱1t)

Vdbcorrection = Vdb / (1 - 𝝱1t)

Sdwcorrection = Sdw / (1 - 𝝱2t)

Sdbcorrection = Sdb / (1 - 𝝱2t)

Therefore, the new formula will be:

w= wt-1 - (𝛂/sqrt( Sdwcorrection + ɛ )).Vdwcorrection for weights.

b= bt-1 - (𝛂/sqrt( Sdbcorrection + ɛ )).Vdbcorrection for biases.

Code for ADAM optimizer

def adam(inits, X, Y, learning_rate=0.01, num_of_iter=10, b1=0.9, b2=0.999, Epsilon=1e-6):
    n = len(X)
    a, b = inits
    grad_a, grad_b = lambda x, y: -2*x*(y-(a*x+b)), lambda x, y: -2*(y-(a*x+b))
    v_a, v_b = 0, 0
    s_a, s_b = 0, 0
    a_list, b_list = [a], [b]
    t = 1
    for _ in range(num_of_iter):
        for i in range(n):
            x_i, y_i = X[i], Y[i]
            g_a, g_b = grad_a(x_i, y_i), grad_b(x_i, y_i)
            # computing the first moment
            v_a = b1*v_a + (1-b1)*g_a
            v_b = b1*v_b + (1-b1)*g_b
            # computing the second moment
            s_a = b2*s_a + (1-b2)*(g_a**2)
            s_b = b2*s_b + (1-b2)*(g_b**2)
            
            # normalisation
            v_a_norm, v_b_norm = v_a/(1 - np.power(b1, t)), v_b/(1 - np.power(b1, t))
            s_a_norm, s_b_norm = s_a/(1 - np.power(b2, t)), s_b/(1 - np.power(b2, t))
            t += 1
            
            # updating gradient
            g_a_norm = learning_rate * v_a_norm / (np.sqrt(s_a_norm) + Epsilon)
            g_b_norm = learning_rate * v_b_norm / (np.sqrt(s_b_norm) + Epsilon)
            
            # updating params
            a -= g_a_norm
            b -= g_b_norm
            
            a_list.append(a)
            b_list.append(b)
    return a_list, b_list
You can also try this code with Online Python Compiler
Run Code

FAQs

  1. What is ADAM in neural networks?
    Adam is an optimization solver for the Neural Network algorithm that is computationally efficient, requires little memory, and is well suited for problems that are large in terms of data or parameters or both.
     
  2. Which is the best optimizer to date?
    ADAM is the best optimizer to date. It trains the neural network in less time and more efficiently.
     
  3. What is beta in ADAM optimizer?
    The hyper-parameters β1 and β2 of Adam are initial decay rates used when estimating the first and second moments of the gradient, which are multiplied by themselves (exponentially) at the end of each training step (batch).

Key Takeaways

In this article, we discussed the following topics:

  • Gradient descent with momentum optimizer
  • RMS prop optimizer
  • ADAM optimizer
  • Derivation and implementation of ADAM optimizer

Hello readers, here's a perfect course that will guide you to dive deep into Machine learning.

Happy Coding!

Live masterclass