Table of contents
1.
Introduction
2.
Reason Behind Using RMSProp
3.
RMSProp
4.
Parameters Updation
5.
Ideal Values of Alpha and Beta
6.
Working Of RMSProp
7.
Implementation
8.
Frequently Asked Questions
9.
Key Takeaways
Last Updated: Mar 27, 2024

RMSProp

Author Mayank Goyal
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Optimization is a mathematical technique that determines the best solution. Optimization algorithms in deep learning (especially in neural networks) minimize an objective function like a loss function, which calculates the difference between the predicted data and the expected values.  

Optimization algorithms generate better results by updating the model parameters such as the Weight(W) and bias(b). The most important of them is Gradient Descent. The objective of Gradient Descent is to reach the global minima where the cost function attains the least possible value. Gradient descent is an algorithm that helps calculate the loss function's gradient to navigate the search space.

Another is Stochastic Gradient Descent, where we fixed the learning rate value for all the recurrent networks. Hence, it results in slow convergence for adaptive methods like RMSprop, where each parameter is a variable even learning rate. This algorithm is ensured to converge, even if the input sample is not linearly separable, to a minimum error function for a well-chosen learning rate.

Reason Behind Using RMSProp

Firstly, let us look at the contour plotting of a cost function:

                                                img_src

At the start of the training of our model, our cost will be pretty high (point A). From there, we try to calculate the negative descent and take another step in the direction it specifies unless we reach the global minima(Redpoint).

When we start gradient descent from point A after one iteration of gradient descent, we may end up at point B. Then another gradient descent step may end up at point C. So, it will take many steps to move slowly towards the minimum.

But what's the reason for such a haphazard motion? This  is due to a lot of local optima due to high dimension (as in reality, the cost function depends on many weights, increasing the dimension.)

When trying to optimize the parameters in the case of multi-dimension, there are a lot of local optima (one in each dimension). So, it is easier for Gradient Descent to get stuck in a local optimum than moving to the global optimum. Therefore the algorithm keeps moving from one local optimum to others, and delays reaching the global optimum—these slow down the Gradient Descent. 

To overcome this, we want slower learning in the vertical direction and faster learning in the horizontal direction. So here is what we can do- use RMSProp. RMSProp uses the concept of the Exponentially Weighted Average(EWA) of the gradients. 

RMSProp

RMSProp stands for Root Mean Square Propagation. RMSprop is an optimization technique that we use in training neural networks. RMSProp was first proposed by the father of back-propagation, Geoffrey Hinton. 

The gradients of complex functions like neural networks tend to explode or vanish as the data propagates through the function (known as vanishing gradients problem or exploding gradient descent). RMSProp was developed as a stochastic technique for mini-batch learning.

RMSprop deals with the above issue by using a moving average of squared gradients to normalize the gradient. This normalization balances the step size (momentum), decreasing the step for large gradients to avoid exploding and increasing the step for small gradients to avoid vanishing.

In simple terms, RMSprop uses an adaptive learning rate instead of treating the learning rate as a hyperparameter. This means that the learning rate varies over time. RMSprop uses the same concept of the exponentially weighted average of gradient as gradient descent with momentum, but the difference is parameter update.

 

Parameters Updation

The recursive formula of EWA is given by:

Where,

Vt: Moving average value at t.

In RMSProp, we find the EWA of gradients and update our parameters using those EWAs. On each iteration t, we calculate dw and db on the current minibatch. Then we calculate vdw and vdb using the following formulae:

We will update our parameters after calculating the exponentially weighted averages.

W = W - learning rate * dW

b = b - learning rate * db.

Putting the value of db and dw in respective equations, we get:

We try to diminish the vertical movement using the average because they sum up to 0(approximately) while averaging.

Ideal Values of Alpha and Beta

The suggested value of β is 0.9, which is observed experimentally. If we keep β lesser than 0.9, then the fluctuation in the value of vdw and vdb will be very high. On the other hand, keeping the value of β higher than 0.9 will not give a proper value for average.

The suggested value of alpha is 0.001. Keeping alpha less than 0.001 will make the learning slower, whereas keeping it more than 0.001 will generate the chance of overshooting the minima.

Working Of RMSProp

We start gradient descent from point A. After one iteration of gradient descent, we may end up at point B. Then another gradient descent step may end up at point C. After some iteration, we step towards the local optima with oscillations up and down. Using a higher learning rate would make the vertical oscillation frequency more significant, slowing our gradient descent. Thus, using a higher learning rate is prevented.

The bias plays a role in determining the vertical oscillations, whereas the weight defines the movement in the horizontal direction. If we slow down the updatation process for bias, then the vertical oscillations can be dampened, and if we update ‘weights’ with higher values, we can still move quickly towards the local optima.

We require to slow down the learning in the vertical direction and boost up or at least not slow down the learning in the horizontal direction. i.e., in the horizontal direction, we want to move fast, and in the vertical direction, we want to slow down or damp out the oscillation.

The derivative db is much larger in the vertical direction than dw in the horizontal direction. So because of this Vdw will be small, and Vdb will be larger, and hence the updates will be slowed down in vertical directions.

Talking about the net effect of RMSProp, the vertical direction movement is reduced because reaching the minimum point is quick and learning of the parameters isis much faster than the earlier cases.

Implementation

                                          Img_src

Frequently Asked Questions

  1. Is RMSProp stochastic gradient descent?
    RMSProp lies in the adaptive learning rate methods region, growing in popularity in recent years. It is the extension of the Stochastic Gradient Descent (SGD) algorithm, momentum method, and the foundation of the Adam algorithm.
     
  2. Is Adam better than RMSProp?
    Currently, Adam is a little bit better than RMSProp. However, we can also try using SGD with Nesterov Momentum as an alternative.
     
  3. What is the reason for using an RMSProp optimizer?
    The RMSprop optimizer minimizes the oscillations in the vertical direction. So, we can increase our learning rate, and our algorithm could take larger steps in the horizontal direction converging faster. 
     
  4. Who invented RMSprop?
    RMSprop— is an unpublished optimization algorithm designed for neural networks, first proposed by Geoff Hinton.
     
  5. What is the difference between RMSprop and Adam?
    While momentum accelerates our search in direction of minima, RMSProp impedes our search in direction of oscillations. Adam combines the heuristics of both Momentum and RMSProp.

Key Takeaways

Let us brief the article.

Firstly we saw the reason behind using RMSProp, the need for RMSProp, and have a detailed study about RMSProp. We saw how the parameters get updated and the working behind of RMSPRop. Lastly, we saw the ideal alpha and beta values and ended the article with python implementation. 

RMSprop is a good, fast, and very popular optimizer. It learns properly and tunes the internal parameters to minimize the Cost function. Root Mean Square Propagation(RMSProp) is one of the most popular optimization algorithms we use in deep learning. Adam only surpasses its popularity. So, RMSProp is an excellent alternative to Gradient Descent.

I hope you all like this article.

Happy Learning Ninjas!

Live masterclass