# Understanding and Implementing Architectures of ResNet and ResNeXt for state-of-the-art Image…

### Understanding and Implementing Architectures of ResNet and ResNeXt for state-of-the-art Image Classification: From Microsoft to Facebook [Part 2]

#### In this part-2/2 of blog post we will explore the optimal functions used in skip-connections of ResNet blocks. Discuss the ResNeXt architecture and implement it in PyTorch.

This is Part 2 of two-part series explaining blog post exploring residual networks.

• Understanding and implementing ResNet Architecture [Part-1]
• Understanding and implementing ResNeXt Architecture[Part-2]

For people who have understood part-1 this would be a fairly simple read. I would follow the same approach as part-1.

1. Brief discussion on Identity mappings in Deep Residual Networks (link to paper) [An important case study]
2. ResNeXt Architecture Review (link to paper)
3. Experimental studies on ResNeXt
4. ResNeXt Implementation in PyTorch

### Brief discussion on Identity mappings in Deep Residual Networks

This paper gives the theoretical understanding of why vanishing gradient problem is not present in Residual networks and the role of skip connections by replacing Identity mapping (x) with different functions.

F is a stacked non-linear layer and f is a Relu activation function.

They found that when both f(y1) and h(x1) are identity mappings, the signal could be directly propagated from one unit to any other units, in both forward and backward direction. Also, both achieve minimum error rate when they are identity mappings. Lets look at each case individually.

#### 1. Finding Optimal h(x_{l}) function

Case-1, Lambda = 0: This will be a plain network. Since w2, w1, w0 are all between {-1, 1}, the gradient vanishes as the network depth increases. This clearly shows vanishing gradient problem

Case-2, Lambda >1: In this case, The backprop value increases incrementally and lead to exploding of gradients.

Case-3, Lambda <1: For shallow networks this might not be problem. But for extra large networks, weight+lambda is still less than <1 in most cases and it achieves the same problem as case-1

case-4, Lambda =1: In this case, Every weight is incremented by 1, This eliminates the problem of multiplying with very large numbers as in case-2 and small numbers as in case-1 and acts as a good barrier.

The paper also reviews by adding backprop and convolution layers to the skip-connection and found that network performance degraded. Below are the 5 experiment networks they have tried, out of which only the first one (a) gave minimum error rate.

#### 2. Finding Optimal f(y_{l}) function

The above 5 architectures were studied on ResNet-110 and ResNet-164 and they obtained the following results. In both the networks pre-activation outperformed all other networks. So having a simple additive and Identity mapping instead of Relu f(x) function is more appropriate. Having a Relu and BN layers in the Residual layer helped the network to optimize quick and regularize better (Less test error) , thus reducing over-fitting.

#### conclusions

So having identity short-cut connections (Case-1) and identity after-addition activation are essential for making information propagation smooth. Ablation experiments are consistent with the derivations discussed above.

### ResNeXt Architecture Review

ResNeXt won 2nd place in ILSVRC 2016 classification task and also showed performance improvements in Coco detection and ImageNet-5k set than their ResNet counter part.

This is a very simple paper to read which introduces a new term called “cardinality”. The paper simply explains this term and make use of it in ResNet networks and does various ablation studies.

The paper made several attempts to describe the complexity of Inception networks and why ResNeXt architecture is simple. I m not going to do this here as it would require the reader to understand Inception networks. I will just talk about the architecture here.

• The above diagram distinguishes between a simple ResNet block and ResNeXt blog.
• It follows a split-transform-aggregate strategy.
• The number of paths inside the ResNeXt block is defined as cardinality. In the above diagram C=32
• All the paths contain the same topology.
• Instead of having high depth and width, Having high cardinality helps in decreasing validation error.
• ResNeXt tries embed more subspaces compared to its ResNet counter part.
• Both the architectures have different width. Layer-1 in ResNet has one conv layer with 64 width, while layer-1 in ResNext has 32 different conv layers with 4 width (32*4 width). Despite the larger overall width in ResNeXt, both the architectures have the same number of parameters(~70k) (ResNet 256*64+3*3*64*64+64*26) (ResNeXt C*(256*d+3*3*d*d+d*256), with C=32 and d=4)

Below is the difference in architecture between ResNet and ResNeXt

So a resnext_32*4d represents network with 4 bottleneck [one block in the above diagram] layers, and each layer having cardinality of 32. later we will observe resnext_32*4d and resnext_64*4d implementations in pytorch.

#### Studies:

Cardinality vs width: with C increasing from 1 to 32, we can clearly see a descrease in top-1 % error rate. Therefore, Increasing the C by decreasing the width has improved the performance of the model.

2. Increasing Cardinality vs Deeper/Wider: Basically 3 cases were studied. 1) Increasing the number of layers to 200 from 101. 2) Going wider by increasing the bottleneck width. 3) Increasing cardinality by doubling C.

They have observed that increasing the C gave better performance improvements. below are the results.

#### Conclusions:

An ensemble of different ResNeXt architecture gave a top-5 error rate of 3.03% thus winning second position in ILSVRC competition.

The architecture is simple in design compared to Inception modules.

#### Implementation in Pytorch

ResNeXt is not officially available in Pytorch. Cadene has implemented and made the pre-trained weights also available.