Backpropagation super simplified!

Source: Deep Learning on Medium

Feed Forward

Going forward in this neural network, we will compute the values for each layer.

Input:

x = z¹ = a¹

H1 Layer:

z² = w¹x + b¹ = w¹a¹ + b¹

a² = f(z²) = f(w¹a¹ + b¹)

H2 Layer:

z³ = w²a² + b²

a³ = f(z³) = f(w²a² + b²)

Output:

z⁴ = w³a³

o = f(z⁴) = f(w³a³) = f(w³(f(w²(f(w¹a¹ + b¹)+b²))))

This is known as feed forward propagation. We must do this again and again by altering our weights and biases to get close to our desired output.

Gradient Descent and Backpropagation

But, the question is how to alter the weights. Well, that’s what is the recipe is for this algorithm. For that, we need to use the Gradient Descent, and propagate backward (thus comes the name backpropagation algorithm), and alter the weights as desired.

Recall that our desired output was y, and current output is o. To evaluate the difference in our prediction, we introduce cost functions, also called the loss functions. It could be as simple as the MSE ( Mean Squared Error) or little complex as the Cross-entropy function. Here, let’s call it C, so, our cost function will be C(o,y).

Now that we have our difference in the form of a function. We can introduce calculus to play with it. We need some function that could help us in decreasing the difference between our predicted value and the actual output, which we have quantified in the form of a cost function. This is where we need gradient descent algorithm, which is used to minimize a function by continuously moving in the direction of steepest descent, which is equal to the negative of the gradient.

Let’s dig deeper into this gradient descent. The gradient of a function means, how much a function change with respect to a change in particular quantity. It is computed by taking the partial derivative of the function with respect to the particular quantity. In this case, we want to see how much our cost function changes with respect to a change in our weights and biases, thus giving an idea of how to alter the weights to get minimum possible error in our prediction.

Going back to our example:

Since, C is a function in o, which in turn is a function in w_1, we can make use of chain rule to get the above partial derivative.

Remember, o = f(z⁴) = f(w³a³) = f(w³(f(w²(f(w¹a¹ + b¹)+b²))))

(Similarly, we can compute the gradient of C w.r.t other weights.)

Here, we need to take note of few things:

The term ∂C/∂w^n is also known as the local gradient.

Similarly for bias:

Alright, now it’s the time to get the fruit for our hard work. We are ready with the amount by which we have to change the weights, i.e., ∆w_n. So, to modify our weights and biases, we need to:

Here, we introduced ϵ to have the control on the influence of the gradient descent. Now the model does the forward propagation with the new weights and biases, and then the back propagation again and again. It goes on until it minimizes our cost function to the lowest possible value according to our constraints.

Little Complex Neural Network

At the moment, you must be feeling quit incomplete, as you could argue that the model was utterly simple, and the things would be very complicated once there are many nodes int layers.

Well, not really. Things still remain the same, and the level of toughness still remains the same. All you have to do now is to get yourself familiar with writing complex indices and our good ole sigma (∑) notation. Keeping that in mind, we will go over the expression again, but for the following neural networks, which you would agree is fairly complex.