Understanding the Gradient Descent

Source: Deep Learning on Medium

The best line is chosen by computing the error or deviations(difference) that the approximated values(values approximated using the chosen line) make from the actual values. In other words, we calculate the sum of squared errors(sum of squared differences of approximated value and real value). Then the line that produces the least squared error value is chosen as the best fit.

Now let’s consider the real machine learning scenario, here we give each data points as input to the machine learning algorithm and as a result, it predicts some outputs. We then evaluate the error each output data by taking the difference between the actual output and the predicted output. We use the term Loss function for computing the error for a single training example. Now we consider the squared average of loss functions(i.e sum of squared error/number of data points)called the cost function.

Figure 2

Once we compute the sum of the squared average(i. e Cost function), then our job is to find the line parameters that produce the minimum cost(i.e we need the slope(m) and intercept(b) values that makes the minimum error).

Getting Deeper into the concept

we know that y = slope*x + intercept (i.e y = mx + c), we also know that error equals to,

E = (1/n)*[(Y₁- Y₁’)² + (Y₂- Y₂’)² ……. (Yₙ- Yₙ’)²]

Rewriting the equation we get,

E = (1/n)*[(Y₁-(mx₁+c))² + (Y₂-(mx₂+c))² …….+ (Yₙ-(mxₙ+c))²]

Now the algorithm iteratively gives different values for m and c and tries to find out the best fit value of m and c for which E is minimum. This method for finding the best values for a and b is called “Least squares

Figure 3: 3D Visualization of the Scenario

We can converge into the minimum squared error value by this method but we need to compute a whole lot of m and c value for this. Even though this computation is performed by the computer, it is not an efficient way to find the minimum error value. Here comes the importance of an optimization technique called gradient descent.

Gradient Descent

Gradient descent is an optimization algorithm that helps us to reduce a lot of computation during the minimization of squared error. The algorithm takes big steps when it is far from minima and little steps when it is near to the minima. This is achieved by making use of something called the learning rate.

Figure 4: Gradient Descent

Note: This size of steps taken to reach the minimum or bottom is called Learning Rate.

Mathematics behind gradient descent!

As we have discussed, the equation for the sum of squared error is equal to,

E = (1/n)*[(Y₁-(mx₁+c))² + (Y₂-(mx₂+c))² …….+ (Yₙ-(mxₙ+c))²]

we want the values for intercept and slope that gives us the minimum sum of squared residuals.


A derivative in calculus is calculated as the slope of the graph at a particular point.

According to the chain rule, we take the derivative of the function with respect to intercept as well as slope,

Now let’s consider the derivative with respect to intercept(equation.1). we are given with x values and y values, so we directly fed into the equation. Initially, we assume intercept as 0 and slope as 1so by making use of these values we compute the slope at that point

Now, we make use of the learning rate(Learning rate is predetermined), we multiply the result(i.e derivative w.r.t intercept) with the learning rate to achieve stepsize.

By making use of step size, we calculate the new intercept

New intercept = old intercept – stepsize

we then again fed the new intercept value to the equation.1(i.e the derivative with respect to intercept)and again we recompute the new intercept. This process repeats until the step size is very close to zero. The rate of decrease in step size value will be more initially(i.e it makes bigger steps when it is far from minimum and smaller steps when it is near to minima), but it will decrease gradually.

similarly, the same process is repeated with the derivative w.r.t slope(equation 2).

This was the basic overview of gradient descent. Since our primary focus is on deep learning with Keras, all these functionalities are present within it. But if you want to know more about gradient descent visit the links in references.


  1. understanding the-mathematics behind gradient descent
  2. Gradient Descent, Step-by-Step
  3. https://youtu.be/PaFPbb66DxQ