Source: Deep Learning on Medium

The best line is chosen by computing the *error* or *deviations(difference) *that the approximated values(values approximated using the chosen line) make from the actual values. In other words, we calculate the sum of squared errors(sum of squared differences of approximated value and real value). **Then the line that produces the least squared error value is chosen as the best fit.**

Now let’s consider the real machine learning scenario, here we give each data points as input to the machine learning algorithm and as a result, it predicts some outputs. We then evaluate the *error *each output data by taking the difference between the actual output and the predicted output. We use the term ** Loss function** for computing the error for a single training example. Now we consider the squared average of

**loss functions**(i.e sum of squared error/number of data points)called the

**cost function.**

*Once we compute the sum of the squared average(i. e Cost function), then our job is to find the line parameters that produce the minimum cost(i.e we need the slope(m) and intercept(b) values that makes the minimum error).*

## Getting Deeper into the concept

we know that **y = slope*x + intercept **(i.e y = mx + c), we also know that error equals to,

E = (1/n)*[(Y₁- Y₁’)² + (Y₂- Y₂’)² ……. (Yₙ- Yₙ’)²]

Rewriting the equation we get,

E = (1/n)*[(Y₁-(mx₁+c))² + (Y₂-(mx₂+c))² …….+ (Yₙ-(mxₙ+c))²]

Now the algorithm iteratively gives different values for ** m **and

**and tries to find out the best fit value of**

*c***and**

*m***for which E is minimum. This method for finding the best values for a and b is called “**

*c**Least squares*”

We can converge into the minimum squared error value by this method but we need to compute a whole lot of m and c value for this. Even though this computation is performed by the computer, it is not an efficient way to find the minimum error value. **Here comes the importance of an optimization technique called gradient descent.**

## Gradient Descent

Gradient descent is an optimization algorithm that helps us to reduce a lot of computation during the minimization of squared error. The algorithm takes big steps when it is far from minima and little steps when it is near to the minima. This is achieved by making use of something called the *learning rate*.

Note: This size of steps taken to reach the minimum or bottom is called

Learning Rate.

## Mathematics behind gradient descent!

As we have discussed, the equation for the sum of squared error is equal to,

E = (1/n)*[(Y₁-(mx₁+c))² + (Y₂-(mx₂+c))² …….+ (Yₙ-(mxₙ+c))²]

we want the values for ** intercept** and

**that gives us the minimum sum of squared residuals.**

*slope*

NoteA

derivativein calculus is calculated as theslopeof the graph at a particular point.

According to the chain rule, we take the derivative of the function with respect to intercept as well as slope,