Review: ResNet — Winner of ILSVRC 2015 (Image Classification, Localization, Detection)

In this story, ResNet [1] is reviewed. ResNet can have a very deep network of up to 152 layers by learning the residual representation functions instead of learning the signal representation directly.

ResNet introduces skip connection (or shortcut connection) to fit the input from the previous layer to the next layer without any modification of the input. Skip connection enables to have deeper network and finally ResNet becomes the Winner of ILSVRC 2015 in image classification, detection, and localization, as well as Winner of MS COCO 2015 detection, and segmentation.

ILSVRC 2015 Image Classification Ranking

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images and 100,000 testing images.

What are covered

  1. Problems of Plain Network (Vanishing/Exploding Gradient)
  2. Skip / Shortcut Connection in Residual Network (ResNet)
  3. ResNet Architecture
  4. Bottleneck Design
  5. Ablation Study
  6. Comparison with State-of-the-art Approaches (Image Classification)
  7. Comparison with State-of-the-art Approaches (Object Detection)

1. Problems of Plain Network

For conventional deep learning networks, they usually have conv layers then fully connected (FC) layers for classification task like AlexNet, ZFNet and VGGNet, without any skip / shortcut connection, we call them plain networks here. When the plain network is deeper (layers are increased), the problem of vanishing/exploding gradients occurs.

Vanishing / Exploding Gradients

During backpropagation, when partial derivative of the error function with respect to the current weight in each iteration of training, this has the effect of multiplying n of these small / large numbers to compute gradients of the “front” layers in an n-layer network

When the network is deep, and multiplying n of these small numbers will become zero (vanished).

When the network is deep, and multiplying n of these large numbers will become too large (exploded).

We expect deeper network will have more accurate prediction. However, below shows an example, 20-layer plain network got lower training error and test error than 56-layer plain network, a degradation problem occurs due to vanishing / exploding gradients.

Plain Networks for CIFAR-10 Dataset

2. Skip / Shortcut Connection in Residual Network (ResNet)

To solve the problem of vanishing/exploding gradients, a skip / shortcut connection is added to add the input x to the output after few weight layers as below:

A Building Block of Residual Network

Hence, the output H(x)= F(x) + x. The weight layers actually is to learn a kind of residual mapping: F(x)=H(x)-x.

Even if there is vanishing gradient for the weight layers, we always still have the identity x to transfer back to earlier layers.

3. ResNet Architecture

34-layer ResNet with Skip / Shortcut Connection (Top), 34-layer Plain Network (Middle), 19-layer VGG-19 (Bottom)

The above figure shows the ResNet architecture.

  1. The VGG-19 [2] (bottom) is a state-of-the-art approach in ILSVRC 2014.
  2. 34-layer plain network (middle) is treated as the deeper network of VGG-19, i.e. more conv layers.
  3. 34-layer residual network (ResNet) (top) is the plain one with addition of skip / shortcut connection.

For ResNet, there are 3 types of skip / shortcut connections when the input dimensions are smaller than the output dimensions.

(A) Shortcut performs identity mapping, with extra zero padding for increasing dimensions. Thus, no extra parameters.

(B) The projection shortcut is used for increasing dimensions only, the other shortcuts are identity. Extra parameters are needed.

(C) All shortcuts are projections. Extra parameters are more than that of (B).

4. Bottleneck Design

Since the network is very deep now, the time complexity is high. A bottleneck design is used to reduce the complexity as follows:

The Basic Block (Left) and The Proposed Bottleneck Design (Right)

The 1×1 conv layers are added to the start and end of network as in the figure (right). This is a technique suggested in Network In Network and GoogLeNet (Inception-v1). It turns out that 1×1 conv can reduce the number of connections (parameters) while not degrading the performance of the network so much. (Please visit my review if interested.)

With the bottleneck design, 34-layer ResNet become 50-layer ResNet. And there are deeper network with the bottleneck design: ResNet-101 and ResNet-152. The overall architecture for all network is as below:

The overall architecture for all network

It is noted that VGG-16/19 has 15.3/19.6 billion FLOPS. ResNet-152 still has lower complexity than VGG-16/19!!!!

5. Ablation Study

5.1 Plain Network VS ResNet

Validation Error: 18-Layer and 34-Layer Plain Network (Left), 18-Layer and 34-Layer ResNet (right)
Top-1 Error Using 10-Crop Testing

When plain network is used, 18-layer is better than 34-layer, due to the vanishing gradient problem.

When ResNet is used, 34-layer is better than 18-layer, vanishing gradient problem has been solved by skip connections.

If we compare 18-layer plain network and 18-layer ResNet, there is no much difference. This is because vanishing gradient problem does not appear for shallow network.

5.2 Other Settings

Batch Normalization (from Inception-v2) is used after each conv. 10-crop testing is used. And fully convolutional form with averaging the scores at multiple scales {224, 256, 384, 480, 640} is adopted. 6 models are used for ensemble boosting. These are some techniques used in previous deep learning framework. If interested, please also feel free to read my reviews.

6. Comparison with State-of-the-art Approaches (Image Classification)


10-Crop Testing Results

By comparing ResNet-34 A ,B, and C, B is slightly better than A and C is marginally better than B because extra parameters are introduced with all obtain around 7% error rate. (Indeed, later on, they reformulate the residual network by treating batch normalization as pre-activation function with other small different settings and prove that A is better. Nevertheless, they have similar results. I will cover this later on.)

By increasing the network depth to 152 layers, 5.71% top-5 error rate is obtained which is much better than VGG-16, GoogLeNet (Inception-v1), and PReLU-Net.

10-Crop Testing + Fully Conv with Multiple Scale Results

With 10-Crop Testing + Fully Conv with Multiple, ResNet-152 can obtain 4.49% error rate.

10-Crop Testing + Fully Conv with Multiple Scale + 6-Model Ensemble Results

Added with 6-model ensemble technique, the error rate is 3.57%.

6.2 CIFAR-10

CIFAR-10 Results

Though with skip connection, we can go deeper. However, when the number of layers is going from 110 to 1202, we find that the error rate is increased from 6.43% to 7.93%. Nevertheless, ResNet-1202 does not have optimization difficulty, i.e. it stills can be converged.

7. Comparison with State-of-the-art Approaches (Object Detection)

PASCAL VOC 2007/2012 mAP (%)

By adopting the ResNet-101 into Faster R-CNN [3–4], ResNet obtains better performance than VGG-16 by large margin.

And ResNet finally won the 1st places on ImageNet Detection, Localization, COCO Detection and COCO Segmentation!!!


  1. [2016 CVPR] [ResNet]
    Deep Residual Learning for Image Recognition
  2. [2015 ICLR] [VGGNet]
    Very Deep Convolutional Networks for Large-Scale Image Recognition
  3. [2015 NIPS] [Faster R-CNN]
    Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
  4. [2017 TPAMI] [Faster R-CNN]
    Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

My Reviews

  1. Review: Faster R-CNN (Object Detection)
  2. Review: Batch Normalization (Inception-v2 / BN-Inception) -The 2nd to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification)
  3. Review: PReLU-Net, The First to Surpass Human-Level Performance in ILSVRC 2015 (Image Classification)
  4. Review: GoogLeNet (Inception v1) — Winner of ILSVRC 2014 (Image Classification)
  5. Review: VGGNet — 1st Runner-Up (Image Classification), Winner (Localization) in ILSVRC 2014

Source: Deep Learning on Medium