Understanding and implementation of Residual Networks(ResNets)

Source: Deep Learning on Medium

Understanding and implementation of Residual Networks(ResNets)

Residual learning framework to ease the training of networks that are substantially deeper than those used previously.

This article is primarily based on research paper “Deep Residual Learning for Image Recognition” published by Microsoft Research. [Link to the research paper] and Convolutional Neural Network course by Andrew Ng.

The problem of very deep neural networks:

In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.

  • One of the major benefits of a very deep network is that it can represent very complex functions.
  • However, a huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent prohibitively slow.
  • More specifically, during gradient descent, as we backprop from the final layer back to the first layer, we are multiplying by the weight matrix on each step. If the gradients are small, due to large number of multiplications, the gradient can decrease exponentially quickly to zero (or, in rare cases, grow exponentially quickly and “explode” to take very large values).

The regular networks like VGG-16 are called “plain” networks.

In plain networks, as the number of layers increase from 20 to 56 (as shown below), even after thousands of iterations, training error was worse for a 56 layer compared to a 20 layer network.

Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error.

In theory, we expect having a deeper network should only help but in reality, the deeper network has higher training error, and thus test error.

Why is this happening and how could we fix it?

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly.

Using deeper networks is degrading the performance of the model. Microsoft Research paper tries to solve this problem using Deep Residual learning framework.

Solution: Residual Block / Identity block

The idea is that instead of letting layers learn the underlying mapping, let the network fit the residual mapping. So, instead of say H(x), initial mapping, let the network fit, F(x) := H(x)-x which gives H(x) := F(x) + x

The approach is to add a shortcut or a skip connection that allows information to flow, well just say, more easily from one layer to the next’s next layer, i.e., you bypass data along with normal CNN flow from one layer to the next layer after the immediate next.

A Residual Block:

Residual learning: a building block

Two take aways from residual block:

  1. Adding additional / new layers would not hurt the model’s performance as regularisation will skip over them if those layers were not useful.
  2. If the additional / new layers were useful, even with the presence of regularisation, the weights or kernels of the layers will be non-zero and model performance could increase slightly.

Therefore, by adding new layers, because of the “Skip connection” / “residual connection”, it is guaranteed that performance of the model does not decrease but it could increase slightly.

By stacking these ResNet blocks on top of each other, you can form a very deep network. Having ResNet blocks with the shortcut also makes it very easy for one of the blocks to learn an identity function. This means that you can stack on additional ResNet blocks with little risk of harming training set performance.

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different.

1.The identity block — same as the one we saw above. The identity block is the standard block used in ResNets, and corresponds to the case where the input activation has the same dimension as the output activation.

Identity block. Skip connection “skips over” 2 layers

2. The Convolutional block — We can use this type of block when the input. and output dimensions don’t match up. The difference with the identity block is that there is a CONV2D layer in the shortcut path.

Convolutional block

Results from the paper:

Architectures for ImageNet. Building blocks are shown in brackets, with the numbers of blocks stacked.
Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: Plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers.
Top-1 error (%, 10-crop testing) on ImageNet validation

The 34 layer ResNet performed better than the 18 layer ResNet and plain counter part. So the degradation problem was addressed on deep ResNet better than the shallower network, both plain and ResNet one.

For deeper networks (50 and above) authors introduced bottleneck architectures for economical gains.

Was ResNet Successful? — Yes.

  • Won 1st place in the ILSVRC 2015 classification competition with top-5 error rate of 3.57% (An ensemble model)
  • Won the 1st place in ILSVRC and COCO 2015 competition in ImageNet Detection, ImageNet localization, Coco detection and Coco segmentation.
  • Replacing VGG-16 layers in Faster R-CNN with ResNet-101. They observed a relative improvements of 28%
  • Efficiently trained networks with 100 layers and 1000 layers also.

Building your first ResNet model (50 layers)

You now have the necessary blocks to build a very deep ResNet. The following figure describes in detail the architecture of this neural network. “ID BLOCK” in the diagram stands for “Identity block,” and “ID BLOCK x3” means you should stack 3 identity blocks together.

The details of the above ResNet-50 model are:

  • Zero-padding: pads the input with a pad of (3,3)
  • Stage 1: The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is “conv1”. BatchNorm is applied to the channels axis of the input. MaxPooling uses a (3,3) window and a (2,2) stride.
  • Stage 2: The convolutional block uses three set of filters of size 64x64x256, f=3, s=1 and the block is “a”. The 2 identity blocks use three set of filters of size 64x64x256, f=3 and the blocks are “b” and “c”.
  • Stage 3: The convolutional block uses three set of filters of size 128x128x512, f=3, s=2 and the block is “a”. The 3 identity blocks use three set of filters of size 128x128x512, f=3 and the blocks are “b”, “c” and “d”.
  • Stage 4: The convolutional block uses three set of filters of size 256x256x1024, f=3, s=2 and the block is “a”. The 5 identity blocks use three set of filters of size 256x256x1024, f=3 and the blocks are “b”, “c”, “d”, “e” and “f”.
  • Stage 5: The convolutional block uses three set of filters of size 512x512x2048, f=3, s=2 and the block is “a”. The 2 identity blocks use three set of filters of size 256x256x2048, f=3 and the blocks are “b” and “c”.
  • The 2D Average Pooling uses a window of shape (2,2) and its name is “avg_pool”.
  • The flatten doesn’t have any hyperparameters or name.
  • The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation. Its name should be ‘fc‘ + str(classes).

Summary:

  • Very deep neural networks (plain networks) are not practical to implement as they are hard to train due to vanishing gradients.
  • The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
  • There are two main types of ResNets blocks: The identity block and the convolutional block.
  • Very deep Residual Networks are built by stacking these blocks together.

References: