**Introduction**

Optimization is a mathematical technique that determines the best solution. Optimization algorithms in deep learning (especially in neural networks) minimize an objective function like a loss function, which calculates the difference between the predicted data and the expected values.

Optimization algorithms generate better results by updating the model parameters such as the Weight(W) and bias(b). The most important of them is Gradient Descent. The objective of Gradient Descent is to reach the global minima where the cost function attains the least possible value. Gradient descent is an algorithm that helps calculate the loss function's gradient to navigate the search space.

Another is Stochastic Gradient Descent, where we fixed the learning rate value for all the recurrent networks. Hence, it results in slow convergence for adaptive methods like RMSprop, where each parameter is a variable even learning rate. This algorithm is ensured to converge, even if the input sample is not linearly separable, to a minimum error function for a well-chosen learning rate.

**Reason Behind Using RMSProp**

Firstly, let us look at the contour plotting of a cost function:

At the start of the training of our model, our cost will be pretty high (point A). From there, we try to calculate the negative descent and take another step in the direction it specifies unless we reach the global minima(Redpoint).

When we start gradient descent from point A after one iteration of gradient descent, we may end up at point B. Then another gradient descent step may end up at point C. So, it will take many steps to move slowly towards the minimum.

But what's the reason for such a haphazard motion? This is due to a lot of local optima due to high dimension (as in reality, the cost function depends on many weights, increasing the dimension.)

When trying to optimize the parameters in the case of multi-dimension, there are a lot of local optima (one in each dimension). So, it is easier for Gradient Descent to get stuck in a local optimum than moving to the global optimum. Therefore the algorithm keeps moving from one local optimum to others, and delays reaching the global optimumâ€”these slow down the Gradient Descent.

To overcome this, we want slower learning in the vertical direction and faster learning in the horizontal direction. So here is what we can do- use RMSProp. RMSProp uses the concept of the Exponentially Weighted Average(EWA) of the gradients.