Leveraging ChatGPT - GenAI as a Microsoft Data Expert

Speaker

Prerita Agarwal

Data Specialist @

23 Jul, 2024 @ 01:30 PM

Introduction

Deep learning is a sub-field of Machine learning. It is the key to many pioneering technologies like driverless cars and AI personal assistants like Siri. A Deep learning model involves various algorithms that enable the computer to learn from training data to make predictions on unseen data. One such algorithm that is used in deep learning is optimization.

We will look at one of the optimization methods known as Adaptive Gradient descent, also known as AdaGrad.

Deep learning Optimizers

Before diving into this topic, it is necessary to understand deep learning optimizers. We aim to minimize the loss function whenever we train our deep learning model. We do this by updating various parameters, such as weight, during each epoch. An optimizer is a function used to change these parameters to get the minimum loss. It changes the accuracy of the model and also affects the learning speed. Let us look at one of the most common optimization techniques known as Gradient descent.

Get the tech career you deserve, faster!

Connect with our expert counsellors to understand how to hack your way to success

User rating 4.7/5

1:1 doubt support

95% placement record

Akash Pal

Senior Software Engineer

326% Hike After Job Bootcamp

Himanshu Gusain

Programmer Analyst

32 LPA After Job Bootcamp

After Job Bootcamp

Gradient descent

It is used to minimize the loss function by updating the parameters in each epoch. A gradient can be defined as the slope of a function. Gradient descent simply means to travel down the slope until a local minimum is reached where the slope is zero. We travel down the slope in steps.

Suppose we are at the initial position ‘a’. Gradient descent is used to get the direction and size of the step we want to take from ‘a’ to reach position ‘b’, closer to the local minimum of our loss function. 𝜸 is the learning rate, it gives us the size of the step that we want to take from our current position, and ⛛f(a) is the gradient of the function f(x) at point a, it tells us about the direction of our step.

It is the most basic type of gradient descent. It considers the loss of every example and then takes the average to update the parameters. So, one step of gradient descent is taken after one epoch. However, if we have thousands or millions of training examples, which is common in the case of deep learning. The training becomes very slow.

Stochastic Gradient Descent (SGD)

In this, we calculate the error of one training example and then use it to take the gradient descent step. Thus, one step of gradient descent is taken after each example.

Mini batch gradient descent

It is a mixture of Batch gradient descent and stochastic gradient descent. We take a batch of examples smaller than the actual dataset. This batch is called a mini-batch. We take one gradient step after going through each mini-batch.

AdaGrad (Adaptive Gradient Descent)

We have learned about gradient descent and its various types. We looked into Stochastic gradient descent, which is excellent when we have a huge dataset. However, our dataset may contain both sparse and dense features. Sparse features have very few non-zero values and require a higher learning rate. Dense features chiefly have non-zero values, requiring a lower learning rate. Stochastic gradient descent considers the same learning rate for each feature. To tackle this problem, we use AdaGrad. AdaGrad uses different learning rates for each feature. Let us see how AdaGrad actually works.

Working of AdaGrad

AdaGrad uses the gradients of all the previous steps to calculate the learning rate of a particular feature at each step.

The learning rate in AdaGrad for step ‘t ’is:

Here, 𝜂 is the initial learning rate. 𝝐 is a small positive value. It is added in the denominator to avoid division by zero if Vt becomes zero. V_{t} is given by:

The gradients of all the previous steps are used to calculate V_{t}. Now, if we have a dense value that is frequently updated. Higher will be the gradient sum collected, and the value of V_{t }will be high, lowering the learning rate. Whereas, if we have a sparse feature, it will be less updated, and its learning rate will be higher than the dense feature. Each feature has its own learning rate for each iteration.

The equation below shows us the updation rule of weight(w) for the t+1^{th} iteration. 𝜂 is the initial learning rate of the feature.

Advantages and Disadvantages of AdaGrad

Advantages

Disadvantages

The learning rate is automatically updated. There is no need to manually update the learning rate for each feature.

A squared term is added for each iteration. Since it is always positive, the learning rate constantly decreases and can become infinitely small.

Gives better results than simple SGD if we have both sparse and dense features

Less efficient than some other optimization algorithms like AdaDelta and Adam

Should we always use AdaGrad? In the realm of Machine Learning, we cannot say with certainty about this for any algorithm. It depends on our data and what we want to achieve from it. However, if we have a large dataset with both sparse and dense features. AdaGrad is recommended.

What should be our initial Learning rate? It depends on our model, but a learning rate of 0.01 is widely used in practice.

How is the problem of diminishing of learning rate solved? It is solved by using a different rule for updating the learning rate than AdaGrad. We can use other optimizers like Adam.

Key Takeaways

This blog gave an overview of AdaGrad. We started with optimizers and learned about gradient descent and its different types. We saw the need to use different learning rates for each feature and how this is achieved by AdaGrad. To get in-depth knowledge of different optimization algorithms, check out our machine learning course on coding ninjas.