Table of contents
1.
Introduction
2.
Deep learning Optimizers
3.
Gradient descent
4.
Types of Gradient Descent
4.1.
Batch Gradient Descent
4.2.
Stochastic Gradient Descent (SGD)
4.3.
Mini batch gradient descent
5.
AdaGrad (Adaptive Gradient Descent)
5.1.
Working of AdaGrad
5.2.
Advantages and Disadvantages of AdaGrad
6.
FAQs
7.
Key Takeaways
Last Updated: Mar 27, 2024

What is AdaGrad?

Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Deep learning is a sub-field of Machine learning. It is the key to many pioneering technologies like driverless cars and AI personal assistants like Siri. A Deep learning model involves various algorithms that enable the computer to learn from training data to make predictions on unseen data. One such algorithm that is used in deep learning is optimization.

We will look at one of the optimization methods known as Adaptive Gradient descent, also known as AdaGrad.

Deep learning Optimizers

Before diving into this topic, it is necessary to understand deep learning optimizers. We aim to minimize the loss function whenever we train our deep learning model. We do this by updating various parameters, such as weight, during each epoch. An optimizer is a function used to change these parameters to get the minimum loss. It changes the accuracy of the model and also affects the learning speed. Let us look at one of the most common optimization techniques known as Gradient descent.

Gradient descent

It is used to minimize the loss function by updating the parameters in each epoch. A gradient can be defined as the slope of a function. Gradient descent simply means to travel down the slope until a local minimum is reached where the slope is zero. We travel down the slope in steps.

                                                                

Link

Below is an equation that tells us the working of Gradient descent.

Link

Suppose we are at the initial position ‘a’. Gradient descent is used to get the direction and size of the step we want to take from ‘a’ to reach position ‘b’, closer to the local minimum of our loss function. 𝜸 is the learning rate, it gives us the size of the step that we want to take from our current position, and ⛛f(a) is the gradient of the function f(x) at point a, it tells us about the direction of our step.

 

Link 

Types of Gradient Descent

Gradient descent is of three types:

Batch Gradient Descent

It is the most basic type of gradient descent. It considers the loss of every example and then takes the average to update the parameters. So, one step of gradient descent is taken after one epoch. However, if we have thousands or millions of training examples, which is common in the case of deep learning. The training becomes very slow.

Stochastic Gradient Descent (SGD)

In this, we calculate the error of one training example and then use it to take the gradient descent step. Thus, one step of gradient descent is taken after each example.

Mini batch gradient descent

It is a mixture of Batch gradient descent and stochastic gradient descent. We take a batch of examples smaller than the actual dataset. This batch is called a mini-batch. We take one gradient step after going through each mini-batch.

AdaGrad (Adaptive Gradient Descent)

We have learned about gradient descent and its various types. We looked into Stochastic gradient descent, which is excellent when we have a huge dataset. However, our dataset may contain both sparse and dense features. Sparse features have very few non-zero values and require a higher learning rate. Dense features chiefly have non-zero values, requiring a lower learning rate. Stochastic gradient descent considers the same learning rate for each feature. To tackle this problem, we use AdaGrad.  AdaGrad uses different learning rates for each feature. Let us see how AdaGrad actually works.

Working of AdaGrad

AdaGrad uses the gradients of all the previous steps to calculate the learning rate of a particular feature at each step.

The learning rate in AdaGrad for step ‘t ’is:

                                                                                                     

Here, 𝜂 is the initial learning rate. 𝝐 is a small positive value. It is added in the denominator to avoid division by zero if Vt becomes zero. Vt is given by:

                                                                                  

The gradients of all the previous steps are used to calculate Vt. Now, if we have a dense value that is frequently updated. Higher will be the gradient sum collected, and the value of Vt will be high, lowering the learning rate. Whereas, if we have a sparse feature, it will be less updated, and its learning rate will be higher than the dense feature. Each feature has its own learning rate for each iteration.

The equation below shows us the updation rule of weight(w) for the t+1th iteration. 𝜂 is the initial learning rate of the feature.

                                                                            

Advantages and Disadvantages of AdaGrad

Advantages

Disadvantages

The learning rate is automatically updated. There is no need to manually update the learning rate for each feature. A squared term is added for each iteration. Since it is always positive, the learning rate constantly decreases and can become infinitely small.
Gives better results than simple SGD if we have both sparse and dense features Less efficient than some other optimization algorithms like AdaDelta and Adam

 

Read about Batch Operating System here.

FAQs

  1. Should we always use AdaGrad?
    In the realm of Machine Learning, we cannot say with certainty about this for any algorithm. It depends on our data and what we want to achieve from it. However, if we have a large dataset with both sparse and dense features. AdaGrad is recommended.
     
  2. What should be our initial Learning rate?
    It depends on our model, but a learning rate of 0.01 is widely used in practice.
     
  3. How is the problem of diminishing of learning rate solved?
    It is solved by using a different rule for updating the learning rate than AdaGrad. We can use other optimizers like Adam. 

Key Takeaways

This blog gave an overview of AdaGrad. We started with optimizers and learned about gradient descent and its different types. We saw the need to use different learning rates for each feature and how this is achieved by AdaGrad. To get in-depth knowledge of different optimization algorithms, check out our machine learning course on coding ninjas.

Live masterclass