Source: Deep Learning on Medium
A lucid answer to the Vanishing Gradient Problem!
“Can you explain what is the difference between VGGNet and ResNet?” is a popular interview question asked in the field of AI and Machine Learning. While the answer exists on the internet, I haven’t been able to stumble upon a to-the-point clear and concise answer. We will begin with what is VGGNet, what problem it encountered, and how the ResNet came in to solve it.
VGG stands for Visual Geometry Group (a group of researchers at Oxford who developed this architecture). The VGG architecture consists of blocks, where each block is composed of 2D Convolution and Max Pooling layers. VGGNet comes in two flavors, VGG16 and VGG19, where 16 and 19 are the number of layers in each of them respectively.
In a Convolutional Neural Network (CNN), as the number of layers increase, so does the ability of the model to fit more complex functions. Therefore, more number of layers is always better (not to be confused with an artificial neural network which does not necessarily give a significantly better performance with increase in number of hidden layers). So now you can argue why not use VGG20, or VGG50 or VGG100 and so on.
Well, there is a problem.
The weights of a neural network are updated using the backpropagation algorithm. The backpropagation algorithm makes a small change to each weight in such a way that the loss of the model decreases. How does this happen? It updates each weight such that it takes a step in the direction along which the loss decreases. This direction is nothing but the gradient of this weight (with respect to the loss).
Using chain rule we can find this gradient for each weight. It is equal to (local gradient) x (gradient flowing from ahead), as shown in Fig. 2.
Here comes the problem. As this gradient keeps flowing backward to the initial layers, this value keeps getting multiplied by each local gradient. Hence, the gradient becomes smaller and smaller, making the updates to the initial layers very small, increasing the training time considerably.
We can solve our problem if the local gradient somehow became 1.
Voila! Enter ResNet.
How can the local gradient be 1, i.e, the derivative of which function would always be 1? The Identity function!
So, as this gradient is backpropagated, it does not decrease in value because the local gradient is 1.
The ResNet architecture, shown below, should now make perfect sense as to how it would not allow the vanishing gradient problem to occur. ResNet stands for Residual Network.
These skip connections act as gradient superhighways, allowing the gradient to flow unhindered. And now you can understand why ResNet comes in flavors like ResNet50, ResNet101 and ResNet152.
I hope that this article was of benefit to you.
 CS231n Convolutional Neural Networks for Visual Recognition by Andrej Karpathy.
 K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
 K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.
 draw.io for diagrams.