Introduction
Let’s understand the situation with the help of an example. In this example data set, we have four columns X1 X2 X3, X4 is the input, and Y is the output column.
In this data set, we can see that the X1 column is a Sparse column. Most data points are zero, and very few are non-zero. All the remaining columns are dense. That is, most of the values are non-zero.
If you want to apply multiple regression in this data set, we will identify a hyperplane that gives the minimum loss.
The hyperplane is This is the plane that we will identify that gives the minimum loss. We are randomly selecting some weights, and we are updating using this equation
For understanding purposes, we are updating the weights for w1. We have to update all the Weights that are w2 w3 w4 and so on. Here η Is called the learning rate, and L is the loss function. For a Greater understanding of the equation, I am dropping here a link that will redirect you to implementing the Gradient Descent algorithm. We used loss function to mean a squared loss in case of multiple regression. The partial differentiation of loss function concerning w1 is
Now take a look at the data set. Most of the values in X1 are zeros. If X1 is zero, the derivative value will be zero. So, in this case, our weights are not getting updated. That is the main problem in the case of sparse data. When we discussed the Gradient Descent algorithm, We randomly selected some weights, and it keeps on updating and keeps on going to the minimum point.
Source: medium.com
So what we can observe in this 2D graph is that we are initially taking long steps, and when it is going to the local minimum, it is taking smaller steps compared to the previous. So if we take smaller steps close to the minimum point, there is a high chance of getting the actual point. The learning rate determines the measures taken. If the learning rate is significant, the steps taken are giant and vice-versa.
AdaGrad
Now coming to the Neural networks, we are updating lots of weights. So if some of the column data are sparse and some are dense, then some of the dimensions update quickly, and some are not. To avoid this problem, we are using the AdaGrad Optimizer. The idea here is that we will take the different learning rates for additional weights. The second point is that the learning rate decreases based on previous updates.
AdaGrad equation:
t' is the different learning rates for different dimensions. I am showing the equation for the update In only one size. Rest other follows the same equation.
K Value depends on the previous gradients. Why are we considering previous Gradients? The Update is based on Gradients. zero Gradient means, there is no need to update, and if the Gradient is not zero, we have to reduce the learning rate.