Table of contents
1.
Introduction
2.
AdaGrad
3.
AdaDelta
4.
Frequently Asked Questions
5.
Key Takeaways 
Last Updated: Mar 27, 2024

AdaDelta

Author Rajkeshav
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Let’s understand the situation with the help of an example. In this example data set, we have four columns X1 X2 X3, X4 is the input, and Y is the output column. 

 

 

In this data set, we can see that the X1 column is a Sparse column. Most data points are zero, and very few are non-zero. All the remaining columns are dense. That is, most of the values are non-zero.

If you want to apply multiple regression in this data set, we will identify a hyperplane that gives the minimum loss. 

The hyperplane is This is the plane that we will identify that gives the minimum loss. We are randomly selecting some weights, and we are updating using this equation

For understanding purposes, we are updating the weights for w1. We have to update all the Weights that are w2 w3 w4 and so on. Here η Is called the learning rate, and L is the loss function. For a Greater understanding of the equation, I am dropping here a link that will redirect you to implementing the Gradient Descent algorithm. We used loss function to mean a squared loss in case of multiple regression. The partial differentiation of loss function concerning w1 is 

 

Now take a look at the data set. Most of the values in X1 are zeros. If X1 is zero, the derivative value will be zero. So, in this case, our weights are not getting updated. That is the main problem in the case of sparse data. When we discussed the Gradient Descent algorithm, We randomly selected some weights, and it keeps on updating and keeps on going to the minimum point. 

Source: medium.com

 

So what we can observe in this 2D graph is that we are initially taking long steps, and when it is going to the local minimum, it is taking smaller steps compared to the previous. So if we take smaller steps close to the minimum point, there is a high chance of getting the actual point. The learning rate determines the measures taken. If the learning rate is significant, the steps taken are giant and vice-versa. 

 

AdaGrad

Now coming to the Neural networks, we are updating lots of weights. So if some of the column data are sparse and some are dense, then some of the dimensions update quickly, and some are not. To avoid this problem, we are using the AdaGrad Optimizer. The idea here is that we will take the different learning rates for additional weights. The second point is that the learning rate decreases based on previous updates. 

AdaGrad equation:

 

t' is the different learning rates for different dimensions. I am showing the equation for the update In only one size. Rest other follows the same equation.

K Value depends on the previous gradients. Why are we considering previous Gradients? The Update is based on Gradients. zero Gradient means, there is no need to update, and if the Gradient is not zero, we have to reduce the learning rate.

 

AdaDelta

The drawback of the AdaGrad Optimizer is using the square value of gradients, i.e., . If the number of updates increases, the K value will keep increasing. When the k value becomes vast, the denominator of the expression will be very, very small. Hence our weights will converge very slowly. To avoid this, we don't let K become very large. That is the idea of the AdaDelta optimizer.

The AdaDelta optimizer uses the concept of Exponential Weighted Average. The equation is

 

 

In exponentially weighted average, Based on previously selected gradients, Only a few points will be considered; not all the Gradients are considered. So in this way, the AdaDelta optimizer removes the drawback of the AdaGrad Optimizer.

Frequently Asked Questions

  1. What is the significance of the learning rate used in the Gradient descent algorithm?
    In the Gradient descent algorithm, the learning rate is used to determine the size of steps taken to reach the local minimum point of the loss function.
     
  2. What do we understand by Gradient?
    Gradient is the measure of the slope of a straight line. Gradients can be uphill in direction or Downhill in order. The Uphill gradients take positive values, and the Downhill gradients take the negative value.
     
  3. What are the advantages of the Adagrad Optimizer?
    1. Manual tuning of learning rate is not required
    2. Convergence is faster 
     
  4. What are the types of Gradient Descent?
    1. Batch Gradient Descent
    2. Stochastic Gradient Descent
    3. Mini batch Gradient Descent
     
  5. What is the disadvantage of the AdaDelta optimizer?
    The implementation of AdaDelta Optimizer is computationally more expensive than that of AdaGrad.

Key Takeaways 

Apart from AdaDelta and Adagrad, various optimizers are available in neural networks. If you are interested in learning them in great detail, you must visit here.

Live masterclass