Demystifying Backpropagation

Source: Deep Learning on Medium

Now in order to perform the backpropagation, we need to calculate the derivative(rate of change) of our error function (j).

Since we do not have much control over our network, we will minimize our error by changing the weight (w).

Our expression for the rate of change in (j) with respect to (w) will be:

let’s calculate our first derivative.

To calculate we have to use the chain rule of differentiation.

chain rule

The chain rule tells us how to take the derivative of a function inside the function and the power rule tells us to bring down our exponent 2 and multiply it.

Note: Here we are taking the derivative of our outside function and multiplying with our inside function. Since y=0.5 which is a constant, derivative of constant is 0 and also it does not change with respect to w so we are left with dŷ/dw.

dj/dw = 2(ŷ - y) . d(ŷ)/dw

Since ŷ=x.w we are going to use the product rule.

dj/dw = 2(ŷ - y) . 1.5

now that we have our equation : dj/dw = 4.5w-1.5 let’s perform Gradient Descent.

We are basically going to deduct the rate of change in our error function with respect to weight from our old weight and along the way, we are going to multiply the change with our learning rate (lr) = 0.1.

Note: Learning rate defines by how much we want to take the step.

Our formula to descent our gradient will be:

w(new) := w(old)-lr(dj/dw)
w(new) := w(old) - 0.1(4.5w(old) - 1.5)

Now let’s calculate our new weight with that formula:

|old weight | new weight |
| 0.8 | 0.59 |
| 0.59 | 0.4745 |
| 0.4745 | 0.410975 |
| 0.410975 | 0.37603625 |
| 0.37603625| 0.3568199375 |
|0.3568199375| 0.333333 |

We have successfully trained our neural network.

The optimal weight for our network is 0.333.

1.5 * 0.333 = 0.5+-----------+--------------------+---------------+
|input(x) | expected output(y) | output(ŷ) |
| 1.5 | 0.5 | 0.5 |

Final Words

This is how the underlying process of backpropagation works to find out the optimal weights on artificial neural networks.

We have seen the process for a very simple network without any activation function or biases. However, in real world also the process is same but a bit complex since we have to propagate backward to many hidden layers with activation function and biases.

For the interviews and to express your understanding, this much explanation is sufficient.

Not your cup of tea?

If this explanation also does not click your brain or you want to learn more then you can follow this tutorial series by 3Blue1Brown:

He has also explained the step by step process with nice visual representations.

Or, this video series by Welch Labs:

Or, if you want to learn it from different prospective then here is a nice explanation from Brandon: