RMS prop
In the standard gradient descent algorithm, we can see that the learning rate 𝛂 is constant. It means that the same size of steps is taken to reach the minima(convergence point).
What if we dynamically change the learning rate while training the model?
It will converge the weights to global minimal more quickly than the normal gradient descent method. Let us see the method of dynamically updating the learning rate 𝛂.
The method is the same as the gradient descent method, except for one hyperparameter.
Let us denote a new learning rate 𝛂‘. The new can be written as:
w_{t} = w_{t1}  𝛂‘.dL/dw_{t1} for weights.
b_{t} = b_{t1}  𝛂‘.dL/dw_{t1} for biases.
The value of 𝛂‘ will be:
𝛂‘ = 𝛂 / sqrt( Sdw + ɛ ) for weights.
𝛂‘ = 𝛂 / sqrt( Sdb + ɛ ) for biases.
The value of Sdw_{t} will be:
Sdw_{t} = 𝝱Sdw_{t1} + ( 1  𝝱 ).( dL/dw_{t} )^{2 }for weights.
Sdb_{t} = 𝝱Sdb_{t1} + ( 1  𝝱 ).( dL/db_{t} )^{2} for biases.
Here 𝝱 is the hyperparameter; it specifies the importance given to previous updated weight and bias. Generally, the value of 𝝱 is 0.95.
Using the above equations, we can dynamically change the model's learning rate(𝛂).
So the final equation will be:
w_{t} = w_{t1}  𝛂‘.dL/dw_{t1} for weights.
b_{t} = b_{t1}  𝛂‘.dL/dw_{t1} for biases.
ADAM optimizer
In gradient descent, with momentum optimizer, the learning rate was constant, whereas, in RMS prop optimizer, there is a lot of noise at the time of convergence.
The ADAM (adaptive moment estimation) optimizer is the combination of both gradient descent with momentum optimizer and RMS prop optimizer. So, smoothening as wells as dynamic learning rate is obtained.
Derivation of ADAM optimizer
Let us define 4 terms Vdw, Vdb, Sdw, Sdb.
Initially, set the values of the four variables to 0.
Calculate dL/dw and dL/db using the current minibatch.
Vdw and Vdb are used for smoothening(to add momentum).
Vdw_{t} = 𝝱_{1}Vdw_{t1} + (1  𝝱_{1})dL/dw_{t1} for weights.
Vdb_{t} = 𝝱_{1}Vdb_{t1} + (1  𝝱_{1})dL/db_{t1} for biases.
Sdw, Sdb is used for dynamic learning rate.
Sdw_{t} = 𝝱_{2}Sdw_{t1} + ( 1  𝝱_{2} ).( dL/dw_{t} )^{2 }for weights.
Sdb_{t} = 𝝱_{2}Sdb_{t1} + ( 1  𝝱_{2} ).( dL/db_{t} )^{2} for biases.
Considering all the above equations we will w_{t} & b_{t}.
w_{t }= w_{t1}  (𝛂/sqrt( Sdw + ɛ )).Vdw for weights.
b_{t }= b_{t1}  (𝛂/sqrt( Sdb + ɛ )).Vdb for biases.
Later on a term was introduced called bias correction. Here the values of Vdw, Vdb, Sdw, Sdb are changed according to formula below:
Vdw^{correction} = Vdw / (1  𝝱_{1}^{t})
Vdb^{correction} = Vdb / (1  𝝱_{1}^{t})
Sdw^{correction} = Sdw / (1  𝝱_{2}^{t})
Sdb^{correction} = Sdb / (1  𝝱_{2}^{t})
Therefore, the new formula will be:
w_{t }= w_{t1}  (𝛂/sqrt( Sdw^{correction} + ɛ )).Vdw^{correction} for weights.
b_{t }= b_{t1}  (𝛂/sqrt( Sdb^{correction} + ɛ )).Vdb^{correction} for biases.
Code for ADAM optimizer
def adam(inits, X, Y, learning_rate=0.01, num_of_iter=10, b1=0.9, b2=0.999, Epsilon=1e6):
n = len(X)
a, b = inits
grad_a, grad_b = lambda x, y: 2*x*(y(a*x+b)), lambda x, y: 2*(y(a*x+b))
v_a, v_b = 0, 0
s_a, s_b = 0, 0
a_list, b_list = [a], [b]
t = 1
for _ in range(num_of_iter):
for i in range(n):
x_i, y_i = X[i], Y[i]
g_a, g_b = grad_a(x_i, y_i), grad_b(x_i, y_i)
# computing the first moment
v_a = b1*v_a + (1b1)*g_a
v_b = b1*v_b + (1b1)*g_b
# computing the second moment
s_a = b2*s_a + (1b2)*(g_a**2)
s_b = b2*s_b + (1b2)*(g_b**2)
# normalisation
v_a_norm, v_b_norm = v_a/(1  np.power(b1, t)), v_b/(1  np.power(b1, t))
s_a_norm, s_b_norm = s_a/(1  np.power(b2, t)), s_b/(1  np.power(b2, t))
t += 1
# updating gradient
g_a_norm = learning_rate * v_a_norm / (np.sqrt(s_a_norm) + Epsilon)
g_b_norm = learning_rate * v_b_norm / (np.sqrt(s_b_norm) + Epsilon)
# updating params
a = g_a_norm
b = g_b_norm
a_list.append(a)
b_list.append(b)
return a_list, b_list
FAQs

What is ADAM in neural networks?
Adam is an optimization solver for the Neural Network algorithm that is computationally efficient, requires little memory, and is well suited for problems that are large in terms of data or parameters or both.

Which is the best optimizer to date?
ADAM is the best optimizer to date. It trains the neural network in less time and more efficiently.

What is beta in ADAM optimizer?
The hyperparameters β1 and β2 of Adam are initial decay rates used when estimating the first and second moments of the gradient, which are multiplied by themselves (exponentially) at the end of each training step (batch).
Key Takeaways
In this article, we discussed the following topics:
 Gradient descent with momentum optimizer
 RMS prop optimizer
 ADAM optimizer
 Derivation and implementation of ADAM optimizer
Hello readers, here's a perfect course that will guide you to dive deep into Machine learning.
Happy Coding!