Source: Deep Learning on Medium
A Quick Tour to Cost Function, Gradient Descent, and Back-Propagation
Gradient descent is an algorithm that is used to optimize a convex function, or in terms of machine learning, we can say that it is used to minimize the cost function. While gradient descent is a method to find the gradients or local minima, back-propagation is a method for optimizing or updating these gradients to get the best accuracy or smaller cost function. In this blog, we will discuss the Cost Function, Gradient Descent, and Back-Propagation. Let’s begin
- What is cost Function?
- Intuition Behind Gradient Descent
- Backpropagation in Neural Networks
- Code for Computing Gradient Descent and Backpropagation
What is Cost Function and How it is Calculated?
I am assuming you have a piece of knowledge about artificial neural network structure and a bit of linear algebra, a cost function in simple words is a square of the difference between the model output and the desired output. Let’s say we are applying our neural network on the millions of image, those images will obviously contain some pixel values. We assume some predicted labels and their corresponding actual labels.
Cost function of the above data would be
C = ( 0.05–0 )2 + ( 0.03–0 )2 + ( 0.78–1 )2 + ( 0.92–0 )2 + ( 0.22–0 )2
C = 0.94
Smaller the cost function, better is our model performance on training data, also this model can give the same performance on the data it has never seen before.
The cost function takes all the input, sometimes millions of parameters as an input and provides a single value, which tells how much betterment is needed in our model, it acts as a guide which tells the model that it is performing poorly and it needs some modification regarding its weights and biases, but as we all know, telling the model its performance is not enough but rather we have to provide a method to our model so that it can minimize the error and the methods are gradient descent and back propagation.
Just wrapping cost function in the simple formula:-
Here y is predicted output and y’ is actual output.
The Intuition behind Gradient Descent
Loss compilation and reducing the loss function is one of the most important work to do in neural networks, we reduce our loss function using a very intuitive algorithm known as Gradient Descent which finds out the error and minimizes it, in the mathematical statement, it can optimize the convex function.
Let’s see an overview of Gradient Descent working:-
Steps for Gradient Descent
- Take random 𝚹
- Update 𝚹 in a direction of decreasing gradient(slope)
- Update gradients
Here ղ is learning rate, we have to repeat step 2 until we reach to the local minima. Local minima is an area below which we cannot move further.
Let’s know about Learning Rate:
The learning rate is nothing but steps taken in order to find the local minima, the step size matters a lot, a large step size will make the computation fast as it will cover more area, but because the slope is constantly changing, there is a high risk that it will come to an overshooting point. Hence, I would suggest a smaller step size.
How to calculate Gradients:-
Did you wonder how we are going to update our gradients? And most importantly what are gradients? Let me clear this for you, gradients are the value which is obtained from cost function itself which tells in which direction to move forward in order to find local minima.
Let us consider a cost function;
Here, w is weight and b is biased, and the gradient for this cost function will be calculated by;
G’(w,b) = [∂f/ ∂w , ∂f/∂b]
In multivariate problem, see the figure for visualization
Өo is global-local minima in the above diagram. The formula for updating our gradients would be :
Code for Computing Gradient Descent and Back propagation
Below, we are showing you the Code illustration for calculation of Gradient Descent algorithm in Python,
Back propagation in Neural Networks
Let’s revise all the steps again
Step 1: finding the error
Step 2: minimizing the error
Step 3: updating weights if the error is huge
Step 4: test our model
We find the amount of error using the cost function and then with the help of cost function we find out gradients that help in finding out local minima, but our goal is to find a global-local minima in case of multivariate problems.
Now we use Back-propagation that uses Gradient Descent to update the weights or parameters at every node, initially, we took any random weights and for sure, using those weight our model is going to perform like trash, but these random weights are a method for us to calculate loss so that we can update weight later and check the model accuracy. Back-propagation is an extremely powerful technique with no other alternative.
Here inputs x1, x2 are input values, we have taken only one hidden layer and can observe y1 and y2 as output. Bias is added b1 and b2, these weights are initially totally random. After the loss compilation, we tend to minimize the error for which we have to update our weights, now these weights can be positively updated as well as negatively updated, both depending upon the cost function and gradients as discussed above.
Firstly, network propagates in the forward direction, also known as the Forward Propagation, such that it looks for the difference between predicted labels and actual labels in order to calculate the error, then our networks propagate backward, known as Back-propagation to update weights and biases to get the predicted labels which match the actual labels with high accuracy. We repeatedly do these two steps until our model gets enough trained on the training set such that it can provide optimum results on testing or unseen data as well.
Below is a code for implementing back propagation using Python
We have learned about the computation which runs the neural network as well as deep learning, these algorithms are extremely powerful and can be said as the backbone of the idea behind the neural network and deep learning. For more blogs in Analytics and new technologies do read Analytics Steps.