## Introduction

An optimization algorithm is at the heart of almost every machine learning method. We'll be covering one such algorithm in this post today.

Gradient descent is by far the most prominent optimization approach in machine learning and deep learning right now. Nonetheless, it perplexes many newcomers. If you're new to gradient boosting, the math might seem a little challenging. In this post, we'll explore gradient descent in detail and help you get a better grasp of this concept.

## Gradient descent: What is it?

When training a __machine learning__ model, gradient descent is a type of optimization algorithm. It is based on a convex function that iteratively changes its parameters to reduce a given function to its local minimum. Gradient descent begins by establishing the starting parameter values and then iteratively adjusts the values using calculus to minimize the specified cost function. It's necessary to understand gradients in order to grasp this notion fully.

### Gradient

A gradient is a measure of change in all weights in relation to the change in error. For better understanding, a gradient can be analogous to the slope of a function. A higher gradient can be considered as a steeper slope. This results in the model learning faster. Similarly, if the slope is zero, the model ceases to learn. In math terminology, a gradient is nothing but the partial derivative of its inputs.

Let's go through an example to understand the idea of gradient descent better. Take a look at the three-dimensional graph below in the perspective of a cost function.

Source: __link__

Our aim is to go from the mountain in the upper right corner (at a high cost) to the sea in the lower-left corner (low cost). Beginning at the peak, we take our first step downhill in the direction indicated by the negative gradient. For each subsequent step, we recalculate the gradient with the new coordinates as the input. We repeat this procedure until we reach the bottom of our graph, or reach a point where we can no longer proceed downhill (local minimum).

Source: __link__

Let’s understand a few more technical jargon before we proceed further.

### Learning rate

Referencing the example above, the learning rate refers to the magnitude of the steps we took. We would cover more space with each step if we have a high learning rate, but since the slope of the hill changes at every point, we could potentially risk overshooting our local minima. On the other hand, with a low learning rate, we can guarantee more accuracy but it will take a long time to find the local minima.

Source: __link__

### Cost function

A cost function helps us analyze how good or bad our model is at making predictions for a given set of inputs. The slope of this curve indicates how we should adjust our parameters to improve the model's accuracy.