Introduction
Machine learning is a sub-field of Artificial Intelligence that is changing the world. It affects almost every aspect of our life. Machine learning algorithms differ from traditional algorithms because no human interaction is needed to improve the model. It learns from data and improves accordingly. There are many algorithms used to make this self-learning possible. We will explore one of the algorithms called Nesterov Accelerated Gradient (NAG)
Gradient descent
It is essential to understand Gradient descent before we look at Nesterov Accelerated Algorithm. Gradient descent is an optimization algorithm that is used to train our model. The accuracy of a machine learning model is determined by the cost function. The lower the cost, the better our machine learning model is performing. Optimization algorithms are used to reach the minimum point of our cost function. Gradient descent is the most common optimization algorithm. It takes parameters at the start and then changes them iteratively to reach the minimum point of our cost function.
Source: link
As we can see above, we take some initial weight, and according to that, we are positioned at some point on our cost function. Now, gradient descent tweaks the weight in each iteration, and we move towards the minimum of our cost function accordingly.
The size of our steps depends on the learning rate of our model. The higher the learning rate, the higher the step size. Choosing the correct learning rate for our model is very important as it can cause problems while training.
A low learning rate assures us to reach the minimum point, but it takes a lot of iterations to train, while a very high learning rate can cause us to cross the minimum point, a problem commonly known as overshooting.
Source: link
Drawbacks of gradient descent
The main drawback of gradient descent is that it depends on the learning rate and the gradient of that particular step only. The gradient at the plateau, also known as saddle points of our function, will be close to zero. The step size becomes very small or even zero. Thus, the update of our parameters is very slow at a gentle slope.
Let us look at an example. The starting point of our model is ‘A’. The loss function will decrease rapidly on the path AB because of the higher gradient. But as the gradient decreases from B to C, the learning is negligible. The gradient at point ‘C’ is zero, and it is the saddle point of our function. Even after many iterations, we will be stuck at ‘C’ and will not reach the desired minimum ‘D’.
This problem is solved by using momentum in our gradient descent.