What’s the “Res” in “Resnet”? [Part II]

Source: Deep Learning on Medium

What’s the “Res” in “Resnet”? [Part II]

We really apologize for the delay. We know you have been anxiously waiting for the second part of this series. In the last blog, we saw that degradation leads to poor accuracy in a deeper neural network. In this blog, let us see how they are able to eliminate this problem.

Paper: Deep Residual Learning for Image Recognition

The Solution

The problem of degradation is addressed by Deep Residual Learning Framework. So what is it all about?

The Deep Residual Learning technique basically proposes the following:

“Instead of hoping a few stacked layers to learn a desired unreferenced mapping x->y denoted by h(x), let a residual function f(x) be defined such that f(x) = h(x) — x, which can be remodified as h(x) = f(x) + x”.

Residual function depiction from the original paper

The authors’ hypothesis is that it is easier to optimize the difference f(x) than to optimize the unreferenced mapping h(x). It basically means that if the identity mapping is optimal, it is very easy to make the residual function as 0. In this way, the original experiment of obtaining the same accuracy in a 100-layer network as the 50-layer shallow network was successfully observed.

Types of residual connections:

  1. When input and output dimensions are the same:

Here, the function F(x, {Wi}) represents the residual mapping to be learned. The operation F + x is performed by a shortcut connection and element-wise addition.

2. When input and output dimensions are not the same:

Here, Ws parameter is only used to match the dimensions. The paper proves that identity mapping is sufficient for addressing the degradation problem and that the Ws parameter is to be used only for matching dimensions.

The above snapshot shows the top-1 error percentage for experiments carried out by the authors in 18 and 34 layers, where the 18 layer model was actually a subset of the 34 layer model. The authors argue that the higher error in 34 layer plain net is not due to vanishing gradients as those layers are trained using batch normalization, which ensures forward propagated signals to have non-zero variances. They also argue that the backward propagated gradients exhibit healthy norms with batch normalization. So neither forward nor backward signals vanish.

On the other hand, ResNet not only outperforms plain network, but the 34-layer model has a lower error rate than the 18-layer model. The ResNet is comparable to the plain model as no extra dimensions are introduced since only zero-padding is used for matching dimensions. Let’s have a look at the results from more experiments conducted by the authors:

The authors used three options for implementing ResNet:

a)zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter-free.

b) projection shortcuts are used for increasing dimensions, and other shortcuts are identity.

c) all shortcuts are projections.

Their experiments proved that there was hardly any error difference between all three options, with c) doing better than a) and b) because c) introduced slightly more parameters, which yielded in better error rate. The close error rates of a), b) and c) indicate that ResNet was able to tackle the degradation problem in a convincing fashion.

While creating deeper models, one interesting observation made was that ResNet-152 (11.3 billion FLOPs) was less complex than VGG16/19 (15.3/19.6 billion FLOPs). However, for extremely large networks, the degradation problem was supposedly seen, which the authors blame on overfitting.

Another novel idea presented in this paper, but missed out by many, is the use of augmentation in the testing dataset. Normally, we apply transformations on the training dataset to get more amount of data. But applying augmentation on the testing data itself allows for better validation because testing images can always have noise, blur, etc. and these can be corrected using various transformations and cropping. So, the authors have proposed using 5 copies of each testing image, random cropping, etc. and then choosing the testing value which gives the best result.

Conclusion:

The ResNet was a breakthrough work not only as a new model but it was a new concept, rather than being limited to some particular use case. This use case can be extended to different domains for future explorations.

We hope that you were able to understand how ResNet works and how it delays the degradation problem.

If you liked the blog, don’t forget to follow us and leave us a clap (We don’t mind more than one clap 😝)

Follow us on Twitter, Instagram, LinkedIn, Facebook and GitHub for future updates!