This loss function is known as The Mean Square Error (MSE) and it is one of the most used loss functions. There are many other loss functions such as Cross-Entropy, Hinge, MAE, etc. But have you wondered what is the cost function? and what is the difference between loss and cost functions? Well, the difference is that the loss function is used for a single training example, whereas the cost function is the average loss over the entire training dataset.

Congratulations! We are done with the forward propagation. However, the prediction that we just made may not be very accurate (consider the output 1, but the model predicted 0.7). How can we make a better prediction? Well, we cannot change the input value, but we can change the weight!! Viola. Now you know the secret of deep learning.

Back Propagation
We cannot change the input. However, we can increase the weight, then by multiplying it with the input will give us a larger predicted output, say 0.8. Keep repeating this process (adjusting the weights) and will get better results. Going in the opposite direction to change the weight knows as backpropagation! But how can this be done exactly? Well, this can be done using optimization algorithms. There are different optimizers such as Gradient Descent, Stochastic Gradient Descent, Adam Optimizer , etc.

Gradient Descent
Gradient Descent is one of the optimization algorithms that aims to reduce the loss by adjusting the weights. Of course, changing weights manually is impossible (we have tens and hundreds of weights in a single neural network). So how can we automate this process? and how to tell the function which weight to adjust and when to stop?

Let us start adjusting the weights, checking how does that affect the loss and plot all the results (check the bellow plot). As you can see, at a specific point (the red line) is the minimum loss. To the left of the red line, we have to increase the weight to decrease the loss, whereas to the right of the red line, we obviously need to decrease the weight in order to reduce the loss. The main questions remain: How do we know for a given point if it is to the left or to the right of the red line (in order to know if we should increase or decrease the weight)? And how much we should increase or decrease the weight in order to get closer to the minimum loss? Once we answer this question, then we can reduce the loss and get better accuracy.

Please note in simple 2D dimensions it is easy to get to the minimum point quickly. However, most of the deep learning models deal with high dimensions.

Luckily, there is a way in Math to answer these questions, derivatives (what you ignored in your high school 🙂 ). Using the derivative, we can calculate the instantaneous rate of change for a tangent line on a graph using the derivative of Loss with respect to the Weight.

If you are not familiar with derivatives and the previous sentence sounded like gibberish, then just think of it as a way to measure the slope and direction of a line that touches the graph at a specific point. Based on the slope direction for a given point, we can know if the point exists to the left or to the right of the red line.

In the above figure, you can see that the right side of tangent line number 1 is pointing up which means it is a positive slope. Whereas the rest of the lines are pointing down, negative slopes. Also, note that the gradient of slope number 2 is bigger than the gradient of slope number 6. That is how why to know how much we need to update the weight. The bigger the gradient is, the further the point is from the minimum point.

Learning Rate
After localizing the point to the left or right of the red line, we need to increase/decrease the weight in order to reduce the loss. However, let us say for a given point that exists to the left of the red line, how much should we increase the weight? Please note if you increase the weight significantly, then the point may pass the minimum loss to the other side. Then you have to reduce the weight, and so on. Adjusting the weight randomly is not efficient. Rather, we add a variable called “learning rate” denoted with “η” to control how the adjustment of weights. Generally, you start with a small learning rate to avoid passing the minimum loss. Why does it call the learning rate? Well, the process of reducing the loss and make better predictions is basically when the model learns. At that time, the learning rate is what controls how fast the model learns.

Adjusting the Weights
Finally, we take the slope amount multiplied with the learning rate and we reduce this amount from the old weight in order to get a new weight. You will understand it better by looking at the bellow equation

Stochastic Gradient Descent
While gradient descent uses the entire dataset to compute the gradient, SGD uses a single example of the training dataset at each iteration. SGD typically reaches convergence much faster than batch or standard gradient descent. Batch Gradient Descent uses a batch of training examples each iteration.

Overfitting
When a neural network is very deep, it has too many weights and biases. When that happens, neural networks tend to overfit their training data. In other words, the model will be so accurate to a specific classification task, without generalization. The model scores high scores against the training data, whereas it is scoring very low against the test data. Dropout is one of the solutions.

Dropout
A simple, yet efficient way to avoid overfitting is to use dropout. For each layer, there is a dropout ratio which means deactivate a number of neurons associated with this ration. These neurons will be chosen randomly and will be turned off during that specific iteration. Next iteration, another set of randomly picked neurons will be deactivated, and so on. This helps in generalization the model rather than remembering specific features.

I hope this post was helpful. Please let me know if you have any questions!

Resources