ML FUNDAMENTALS: GRADIENT DESCENT ALGORITHM.

Arjun S
6 min readJan 12, 2021

Lets get into our first algorithm in machine learning — GRADIENT DESCENT ALGORITHM.

In the previous articles I have addressed what cost function is and why should it be minimized. In this article I’m going to give an algorithm called gradient descent for that purpose.

If you have no idea about cost function, Checkout my previous articles to get a base in linear regression and cost function.

https://arjun-s.medium.com/simple-linear-regression-ground-level-understanding-e278ebf028d3?source=your_stories_page-------------------------------------

https://arjun-s.medium.com/ml-fundamentals-what-is-cost-function-d396129cc611?source=your_stories_page-------------------------------------

In one sentence gradient descent is an Iterative optimization algorithm that minimizes the cost function or error and finds the best suitable parameters for the hypothesis to perfectly make a fit in the dataset.

Let’s see what Wikipedia wants to say,

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient of the function at the current point, because this is the direction of steepest descent.

Uh-oh!!!!

That doesn’t looks delightful.

Let me break down this with bunch of examples.

One popular example for this is,

Suppose you are at the top of a mountain, and you have to reach a lake which is at the lowest point of the mountain. A twist is that you are blindfolded and you have zero visibility to see where you are headed. So, what approach will you take to reach the lake? The best way is to check the ground near you and observe where the land tends to descend. This will give an idea in what direction you should take your first step. If you follow the descending path, it is very likely you would reach the lake.

Our ultimate goal is to reach the point of minima (or lowest cost function/error).Usually what we do in machine learning problems is that we initially assume the parameters for the hypothesis. And then we continuously apply the gradient descent algorithm till the cost function reaches the lowest level starting with these assumed parameter values. Let me derive the equation for gradient descent to get a deeper understanding. Here I’m taking simple linear regression for the explanation.

We get the final equation as:

Don’t wonder from where the θo and θ1 came. These are the parameters required for our simple linear regression hypothesis (h(x) = θo +θ1 x).

And what is α or learning rate in the above equation??? And what is actually the derivatives function in the gradient descent equation?

The learning rate determines the size of each “step” as the algorithm moves closer to the minimum while the derivative determines the “direction” to move in. If you are familiar with calculus, you know that the derivative represents the slope of the tangent line at a certain point of the function. This slope is how we are able to determine what the next parameters should be and thus the “direction” we move.

Oh! You might be wondering where the x(i) for the θo gone .Actually its a bit of differentiation and the term is equalized to 1, as we are doing the differentiation with respect to itself ( that is the differentiation of cost function is done with respect to θo ), so I didn’t expressed that term again.

Look at the below figure .Notice the change that occurred when multiple parameters are taken (Multi variable linear regression will be discussed later in upcoming articles).

POINT TO NOTE!!!

One thing we should note is that the parameters should be updated simultaneously. For example in case of simple linear regression there were 2 parameters θo and θ1, these should be updated at the same time and for multiple linear regression according to the number of features all the parameters should be updated simultaneously. Since gradient descent requires the derivative with respect to the previous output of the cost function, it is reliant on the previous parameters. If we were to update each parameter one at a time, we will end up using the new first parameter to determine the second parameter, resulting in an incorrect result.

Consider the graph below with cost function in the y-axis and parameter in the x- axis. It depicts that regardless of the slopes sign the parameter value converges to the minima. A univariate linear expression is taken to easily represent the graph. Otherwise as the number of parameters increases the dimensions of graph also increases and it will be difficult to represent.

We know that the algorithm will repeat until convergence of cost function to the minima. In the above figure as it is a univariate function it will only be having a single extrema and the algorithm will always converge to that minima. But it is not the case always, for multivariant features the cost function cannot be depicted by a single convex graph like above and it might have more than one minima. And therefore there exist a possibility of reaching any of those minima. The minima that the cost function reaches depends on the starting assumption of the parameter values. The graph below depicts that (The below data has 2 features and you can clearly see how its dimension changed now from the above figure).

Also the learning rate should also be adjusted to ensure that the algorithm will converge at a reasonable time. If our learning rate was too small, the algorithm can be too slow. However, if the rate was too big, the algorithm could potentially fail to converge or even overshoot the minimum and diverge. With that said, gradient descent will still be able to converge to a local minimum if the learning rate is at a fixed rate.

This method looks at every example in the training set in every iteration of the algorithm and it is called batch gradient descent. It is not actually a good process, because if there are thousands of training sets it will take more time and computational ability. But there are also other two gradient descent algorithms like Stochastic Gradient Descent and mini-batch Gradient Descent which solves this issue.

Ok I hope you got an understanding about gradient descent. Buckle up ,we have a lot more to explore.

REFERENCES:

Machine Learning: Coursera — Cost Function

Medium articles.

--

--