ResNet with Identity Mapping — Over 1000 Layers Reached (Image Classification)



In this story, An improved ResNet [1] by Microsoft is reviewed. With Identity Mapping, over 1000 layers can be reached for the deep learning architecture, without error increased.

In the previous version of ResNet [2], when ResNet goes from 101 layers to 1202 layers, though ResNet-1202 still can converge, there is degradation of error rate from 6.43% to 7.93% (This result can be seen in [2]). And it is stated as open question without any explanation in [2].

The following figure shows the results of ResNet with Identity Mapping. With layers are up to 1001, previous ResNet [2] only got 7.61% error while new ResNet with Identity Mapping [1] can get 4.92% for CIFAR-10 Dataset.

(a) Previous ResNet [2] (7.61%) (b) New ResNet with Identity Mapping [1] (4.92%) for CIFAR-10 Dataset

But why it can better by keeping the shortcut connection path clean (by moving the ReLU layer from shortcut connection path to conv layer path as in the figure)? In this paper, it is well-explained. And a series of ablation study are done to support the importance of these identity mapping.

The result is even better than Inception-v3 [3]. (If interested, please also read my Inception-v3 review.) With such good result, it is published in 2016 ECCV paper with more than 1000 citations when I was writing this story. (SH Tsang @ Medium)


What are covered

  1. Explanations of the Importance of Identity Mapping
  2. Ablation Study
  3. Comparison with State-of-the-art Approaches

1. Explanations of the Importance of Identity Mapping

The forward feeding, backpropagation and gradient updates which seems to make the deep learning as a secret. I think the explanation here is excellent.

1.1 Feed Forward

In ResNet with Identity Mapping, it is essential to keep clean for the shortcut connection path from input to output without any conv layers, BN and ReLU.

xl is the input at l layer, F(.) is the function which represents the conv layers, BN and ReLU. Then we can formulate like this:

One Particular Layer
L layers from l-th layer

We can see that the input signal xl is still kept here!


1.2 Backpropagation

During backpropagation, we can get the gradient which decomposed into two additive terms:

Gradient which decomposed into two additive terms

Inside the blanket, we can always get “1” at the left term no matter how deep the network. And the right term cannot be always -1 which makes the gradient zero. Thus, the gradient does not vanish!!


1.2 Backpropagation When Identity Mapping Is Violated

On the other hand, what if the left term is not equal to one:

One Particular Layer
L layers from l-th layer
Gradient which decomposed into two additive terms

Similarly, the left term of gradient is the product of λ.

If λ>1, the left term will be exponentially large, and gradient exploding problem occurs. As we should remember, when the gradient exploded, the loss cannot be converged.

If λ<1, the left term will be exponentially small, and gradient vanishing problem occurs. We cannot update the gradient with large value, the loss stays at plateau and end up converged with large loss.

Thus, that’s why we need to keep clean for the shortcut connection path from input to output without any conv layers, BN and ReLU.


2. Ablation Study

2.1 Various types of shortcut connections

110-layer ResNet (54 two-layer residual units) with various types of shortcut connections are tested on CIFAR-10 dataset as below:

Performance of Various Types of Shortcut Connections

Original: That is the previous version of ResNet in [2], with 6.61% error.

Constant Scaling: λ=0.5, suffering from gradient vanishing problem as mentioned, gets 12.35% error with careful selection of bias bg.

Exclusive Gating & Shortcut-only gating: Both attempt to add complexity to shortcut path while still keeping the path equal to “1”. But both cannot get a better results.

1×1 Conv Shortcut: It is similar to the option C in previous ResNet [2]. In previous ResNet, It was found to better using option C. But right now it is found that it is not the case when there are many residual units (too deep).

Dropout Shortcut: It is actually statistically performing λ=0.5.


2.2 Various Usages of Activation

The following results are obtained by playing around the positions of the BN and ReLU:

Performance of Various Usages of Activation

The Previous ResNet & BN After Addition: Both cannot keep clean on the shortcut connection which violates the identity mapping.

ReLU Before Addition: The residual function after ReLU must be non-negative which makes the forward propagated signal is monotonically increasing, while the residual function should be better to have negative values as well.

ReLU-only Pre-Activation: ReLU is not used in conjunction with BN, which cannot enjoy the benefits of BN well.

Full Pre-Activation: The shortcut path is clean, and ReLU is used in conjunction with BN, which make it become the best setting.


2.3 Advantages of Pre-activation in Twofold

2.3.1 Ease of Optimization

Previous ResNet structure (Baseline) vs Pre-activation Unit

Using previous ResNet structure (Baseline) has worse results when going too deep (1001) due to the wrong position of ReLU layer. Using pre-activation unit can always get a better result when the network is going deeper and deeper from 110 to 1001.

2.3.2 Reducing Overfitting

Training Error vs Iterations

Pre-activation unit is on regularization that slightly higher training loss at convergence but with lower test error.


3. Comparison with State-of-the-art Approaches

3.1 CIFAR-10 & CIFAR-100

CIFAR-10 & CIFAR-100 Results

For CIFAR-10, Using ResNet-1001 with proposed pre-activation unit (4.62%), is even better than ResNet-1202 (7.93%) using previous version of ResNet, with 200 layers fewer.

For CIFAR-100, Using ResNet-1001 with proposed pre-activation unit (22.71%), is even better than ResNet-1001 (27.82%) using previous version of ResNet.

For both CIFAR-10 & CIFAR-100, ResNet-1001 with proposed pre-activation unit does not have larger error than ResNet-164, but the previous ResNet [2] does.

On CIFAR-10, ResNet-1001 takes about 27 hours to train with 2 GPUs.

3.2 ILSVRC

ILSVRC Image Classification Results

With only scale augmentation, the previous version of ResNet-152 (5.5%), the winner of ILSVRC 2015, has worse performance than the previous version of ResNet-200 (6.0%) when going deeper due to the wrong position of ReLU.

And the proposed ResNet-200 with Pre-Activation (5.3%) have better results than the previous ResNet-200 (6.0%).

With both scale and aspect ratio augmentation, the proposed ResNet-200 with Pre-Activation (4.8%) is better than Inception-v3 [3] by Google (5.6%).

Concurrently, Google also has a Inception-ResNet-v2 which has 4.9% error, with pre-activation unit, the error is expected to be have further reduction.

On ILSVRC, ResNet-200 takes about 3 weeks to train on 8 GPUs.


After reviewed ResNet and ResNet with Identity Mapping, as well as Inception-v1, Inception-v2 and Inception-v3, I will also have a review for Inception-v4. Please stay tuned!


Source: Deep Learning on Medium