Understanding and Implementing Architectures of ResNet and ResNeXt for state-of-the-art Image…

ResNet and ResNeXt from Microsoft and Facebook Research respectively

Understanding and Implementing Architectures of ResNet and ResNeXt for state-of-the-art Image Classification: From Microsoft to Facebook [Part 2]

In this part-2/2 of blog post we will explore the optimal functions used in skip-connections of ResNet blocks. Discuss the ResNeXt architecture and implement it in PyTorch.

About the series:

This is Part 2 of two-part series explaining blog post exploring residual networks.

  • Understanding and implementing ResNet Architecture [Part-1]
  • Understanding and implementing ResNeXt Architecture[Part-2]

For people who have understood part-1 this would be a fairly simple read. I would follow the same approach as part-1.

  1. Brief discussion on Identity mappings in Deep Residual Networks (link to paper) [An important case study]
  2. ResNeXt Architecture Review (link to paper)
  3. Experimental studies on ResNeXt
  4. ResNeXt Implementation in PyTorch

Brief discussion on Identity mappings in Deep Residual Networks

This paper gives the theoretical understanding of why vanishing gradient problem is not present in Residual networks and the role of skip connections by replacing Identity mapping (x) with different functions.

Residual Network Equation

F is a stacked non-linear layer and f is a Relu activation function.

They found that when both f(y1) and h(x1) are identity mappings, the signal could be directly propagated from one unit to any other units, in both forward and backward direction. Also, both achieve minimum error rate when they are identity mappings. Lets look at each case individually.

1. Finding Optimal h(x_{l}) function

Optimal function for Skip-connections in Residual networks
Backprop of the ResNet module

Case-1, Lambda = 0: This will be a plain network. Since w2, w1, w0 are all between {-1, 1}, the gradient vanishes as the network depth increases. This clearly shows vanishing gradient problem

Case-2, Lambda >1: In this case, The backprop value increases incrementally and lead to exploding of gradients.

Case-3, Lambda <1: For shallow networks this might not be problem. But for extra large networks, weight+lambda is still less than <1 in most cases and it achieves the same problem as case-1

case-4, Lambda =1: In this case, Every weight is incremented by 1, This eliminates the problem of multiplying with very large numbers as in case-2 and small numbers as in case-1 and acts as a good barrier.

The paper also reviews by adding backprop and convolution layers to the skip-connection and found that network performance degraded. Below are the 5 experiment networks they have tried, out of which only the first one (a) gave minimum error rate.

Different skip connections in Residual networks.
Deep Residual networks results

2. Finding Optimal f(y_{l}) function

Different Residual blocks

The above 5 architectures were studied on ResNet-110 and ResNet-164 and they obtained the following results. In both the networks pre-activation outperformed all other networks. So having a simple additive and Identity mapping instead of Relu f(x) function is more appropriate. Having a Relu and BN layers in the Residual layer helped the network to optimize quick and regularize better (Less test error) , thus reducing over-fitting.

Residual networks Error metrics


So having identity short-cut connections (Case-1) and identity after-addition activation are essential for making information propagation smooth. Ablation experiments are consistent with the derivations discussed above.

ResNeXt Architecture Review

ResNeXt won 2nd place in ILSVRC 2016 classification task and also showed performance improvements in Coco detection and ImageNet-5k set than their ResNet counter part.

This is a very simple paper to read which introduces a new term called “cardinality”. The paper simply explains this term and make use of it in ResNet networks and does various ablation studies.

The paper made several attempts to describe the complexity of Inception networks and why ResNeXt architecture is simple. I m not going to do this here as it would require the reader to understand Inception networks. I will just talk about the architecture here.

ResNet (left) and ResNeXt (right) Architecture.
  • The above diagram distinguishes between a simple ResNet block and ResNeXt blog.
  • It follows a split-transform-aggregate strategy.
  • The number of paths inside the ResNeXt block is defined as cardinality. In the above diagram C=32
  • All the paths contain the same topology.
  • Instead of having high depth and width, Having high cardinality helps in decreasing validation error.
  • ResNeXt tries embed more subspaces compared to its ResNet counter part.
  • Both the architectures have different width. Layer-1 in ResNet has one conv layer with 64 width, while layer-1 in ResNext has 32 different conv layers with 4 width (32*4 width). Despite the larger overall width in ResNeXt, both the architectures have the same number of parameters(~70k) (ResNet 256*64+3*3*64*64+64*26) (ResNeXt C*(256*d+3*3*d*d+d*256), with C=32 and d=4)

Below is the difference in architecture between ResNet and ResNeXt

ResNet vs ResNeXt Architecture.

So a resnext_32*4d represents network with 4 bottleneck [one block in the above diagram] layers, and each layer having cardinality of 32. later we will observe resnext_32*4d and resnext_64*4d implementations in pytorch.


Cardinality vs width: with C increasing from 1 to 32, we can clearly see a descrease in top-1 % error rate. Therefore, Increasing the C by decreasing the width has improved the performance of the model.

cardinality vs width

2. Increasing Cardinality vs Deeper/Wider: Basically 3 cases were studied. 1) Increasing the number of layers to 200 from 101. 2) Going wider by increasing the bottleneck width. 3) Increasing cardinality by doubling C.

They have observed that increasing the C gave better performance improvements. below are the results.

cardinality vs Deeper/wider networks


An ensemble of different ResNeXt architecture gave a top-5 error rate of 3.03% thus winning second position in ILSVRC competition.

The architecture is simple in design compared to Inception modules.

Implementation in Pytorch

ResNeXt is not officially available in Pytorch. Cadene has implemented and made the pre-trained weights also available.


I have taken his code and made them easy to experiment different Transfer learning techniques.


I also wrote a blog post explaining how to use this repo. You can find ResNeXt implementations here. Both ResNeXt-32*4d and ResNext-64*4d are available along with image-net pre-trained weights.

Almost any Image Classification Problem using PyTorch

Please share this with all your Medium friends and hit that clap button below to spread it around even more. Also add any other tips or tricks that I might have missed below in the comments!

Subscribe to my Newsletter

Source: Deep Learning on Medium