ResNet: The Most Popular Network in Computer-Vision Era

Source: Deep Learning on Medium

It seems challenging for classifying images using a computer algorithm. Astonishingly, a recent investigation in the computer-vision area succeeds it with a 1.3% top-5 error on the dataset named ImageNet. In 2020, the state-of-the-art in the image classification changed to EfficientNet, which is published by the Google Research Team. However, the network called ResNet performed well in the image classification area for a long period. Moreover, many researchers used ResNet as their network backbones to improve their performance. This article will help you to understand what ResNet is and how it is motivated intuitively.


Degradation Problem

Deep Neural Network suffers from many difficulties in their learning process. Computer-Vision researchers address solutions to them, such as solving vanishing/exploding gradient problems with Batch Normalization. ( In the ResNet paper, it introduces a challenging problem named “Degradation Problem.” Before reading, let’s first think about the question below.

More layers, better accuracy?

It seems quite intuitive that adding layers on the network enlarges output’s diversity. If every added layer is an identity mapping, the new network can output the same value as the original network. Thus, it is persuasive that more layers in a well-trained network, higher classification accuracy. Unfortunately, reality is not.

When you estimate the accuracy using plain networks(before ResNet), as model complexity increases, its accuracy degrades rapidly. This problem is a Degradation Problem. It is not an overfitting problem; however, the network’s performance dropped as the model complexity increases. The authors claim that plain networks are not suitable for approximating identity mapping; thus, adding layer no more guarantees that the layer-added network can express all the values of the network before the layer addition. The motivation of ResNet is to make an identity-mapping suitable network.


To make an identity-mapping suitable network, the authors used a method name Shortcut-Connection. The main intuition of this method is rather than learning function F(x), learn function F(x) + x. It is easier to learn an identity mapping; since the layer weights are all tuned to 0, it’ll produce an identity mapping instead of a zero mapping. Moreover, it is differentiable so that end-to-end trainable.

Another consideration of Shortcut-Connection is adding projection in identity. Since the dimension can be different between the Shortcut-connected layer, there are three considerations. A) Zero-padding on increased dimensions, B) Projection shortcuts are used only on the dimension-changed part, C) All Shortcuts are projections. The table below is an estimation of each case. (A, B, and C behind ResNet-34 means A), B), and C) applied in ResNet-34)

Focus on the second row-box

The result reveals that performing projection on the identity does not seriously impact on performance. Changing the number of parameters makes the comparison with plain-networks harder. Thus, the authors simply used identity mapping in the network.

Overall Backbone

To refer detailed structure of the network, refer to the paper.



They compared two networks: the plain network and ResNet. Two networks used the same layers; however, only ResNet has Shortcut-Connections. They’ve experimented on two datasets: ImageNet and CIFAR-10. The graphs below are the results of the experiment.

(The thin curves denote training errors, and the bold curves denote validation errors)

Performance of Plain-Network on ImageNet

As you can see from the graph, the training error increased as the layer number increased. It means that the plain network is suffering from the degradation problems. How about ResNet?

Performance of ResNet on ImageNet

No more degradation problems. As the number of layers increases, their training error decreases.

Result of experiment on ImageNet

The authors added more layers in ResNet to make more complicated models. As expected, increasing the number of layers improved the performance. This tendency was similar when the experiment is done on CIFAR-10.

Result of experiment on CIFAR-10

However, we can observe that using 1202 layers on the network, performance drops significantly. The paper argues that it is due to overfitting. Even though there is a significant performance drop, it still outperforms the original methods.


ResNet was motivated to address the degradation problem. By intuitive approach, they designed the network to be suitable for identity-mapping approximation. The experiment shows that ResNet excellently addressed the degradation problems, however, working poor for extremely deep networks.

I appreciate any feedback about my articles, for any discussion, I welcome you to mail me. If somewhere is wrong or misunderstood, please tell me. 🙂

Contact me: