Wide Residual Nets: “Why deeper isn’t always better…”

Source: Deep Learning on Medium

Wide Residual Nets: “Why deeper isn’t always better…”

In this article, we are going to address one of the biggest questions in Deep Learning: “Is deeper always better? If so, when is it Deep enough”

Yes, going deep has its benefits, yet is it always the best solution?

The answer is a big “NO”. Going deep has its limitations because there are parts of the problem that will remain untouched. If we want to improve and push forward Deep Learning for Computer Vision we will definitely need to do better than just going deep.

If we trace the state-of-the-art architectures for DL we will see a trend, where most, if not all, take the word Deep in Deep learning a bit too seriously, with the common belief that in order to improve accuracy and representational power all we need is to go deeper and we can see that with the 2015 state-of-the-art ILSVRC(ImageNet competition) winner family of neural networks for Computer Vision recognition task. Named the ResNet( Residual Network)[1] with the number of layers ranging from 19–152 with the best among them of course, being the ResNet-152 layer deep network.

This architecture with over 100-layer deep set a new state-of-the-art accuracy of 94%.

FIG.1

The main idea of ResNet is that we can have skip connections where one flow is processed through a commonly known as skip connection or residual block 2x(Conv-BN-Relu) “F(x)” and then is added back to the main flow “x”.

In the following year another paper titled Identity Mappings in Deep Residual Networks was released, the authors of the paper saw that the ResNet family of network above all other deep architectures showed great accuracy improvements and convergence behaviour that was better than the competition. Interesting enough they developed an even deeper architecture with a staggering 1001 layers.

My Top 3 series

ResNet-1001 layer deep network

FIG.2

One of the main contribution from this paper is the order of the residual block, in the proposed block the activations(Relu and Batch Norm) come first in the “F(x)” flow as “pre-activation” of the weight-layers(Convolution) which is opposite to the original and conventional wisdom of “post-activation”.

Interestingly enough, it seems to work amazingly for them simply because their 1001-layer deep ResNet architecture generalizes better than the original ResNet-152-layer. Let’s pause for a second and think about this, we have increase the number of layers almost 10-fold(10x) to improve the results where the original ResNet start to overfit which is great, but…

Furthermore, one passage in paper[2] got me worried, although a really great paper and from all the people I respect for the simple fact the people I respect and look up to, do the same to them such as Kaiming He(the inventor of the He normal weight initialization).

Right after explaining the benefits of their proposed architecture the authors went on to say:

“ These results suggest that there is much room to exploit the dimension of network depth, a key to the success of modern deep learning. ”

Let us understand some other features of this architecture before we come to any conclusions.

How the conversation between them must have been like.

With the proposed architecture the authors fixed the vanishing gradient problem even when the gradients are extremely small and that’s really great news, let me tell you why:

Vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network’s weights(knowledge as I like to call them) receives an update using the partial derivative of the error function(by how much they missed). The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.

Furthermore, the authors went on to experiment with activation, if you would like to know more you should check the paper after you read this article.[2]

Bottom-line is that pre-activations work amazing, a simple change such as the order of activations gave birth to a colossal 1001-layer deep network.

How authoring this paper must have felt like…

That’s truly remarkable, but the question still stands, are we really going solve everything by going deeper and deeper? or Is there any other option?

GOING WIDE AND JUST THE RIGHT AMOUNT OF DEPTH

Nowadays it’s common sense that: “Deeper isn’t always better.”

The rise of Wide Residual Networks (WRNs)[3]. This too is a simple and small change to your typical ResNet which we will go into in a few.

In the abstract of the WRNs paper[3] the authors mention how incredible ResNets are for the simple fact of been able to scale up-to thousands of layers and still having improving performance.

Then that’s when things get interesting. They then mentioned a big pit-fall of going deeper and deeper, which is, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so as a natural consequence to this big increase in depth of the network during training the network will develop a problem of diminishing feature reuse(the direct cousin of vanishing gradient), which makes the network slow to train.

Diminishing feature reuse[4] during forward propagation (also known as loss in information flow) refers to the analogous problem to vanishing gradients in the forward direction. The features of the input instance, or those computed by earlier layers, are “washed out” through repeated multiplication or convolution with (randomly initialized) weight matrices, making it hard for later layers to identify and learn “meaningful” gradient directions.

Recently, several new architectures attempt to circumvent this problem through direct identity mappings between layers[2], which allow the network to pass on features unimpededly from earlier layers to later layers.

Long training time is a serious concern as networks become very deep. The forward and backward passes scale linearly with the depth of the network. Even on modern computers with multiple state-of-the-art GPUs, architectures like the 152-layer ResNet require several weeks to converge on the ImageNet dataset.

Deep nets running against Wide ones. Credits: google.com

Let’s pause for a second and think about this. We can clearly see from the section above that the the authors of the Identity Mappings in Deep ResNets[2] paper actually have made a significant contribution to addresses both vanishing gradient and it’s cousin diminishing feature reuse through the identity mappings between layers. However, after reading the Wide ResNets[3] paper I believe that the 1000 layer deep set-up is not optimal. Let me convince you why “deeper is not always better”.

Width vs Depth in ResNets

FIG. 3 The proposed architectures are ( C ) and ( D )

The authors of WRNs[3] propose an architecture were we decrease the depth and increase the width of the residual block.

Normal residual blocks aim to make the network as thin as possible to reduce the number of parameters and increasing depth. As you can see in fig.3 (b) we even have a bottleneck architecture that increases the dimension with a 1×1 Conv then reduces the dimension of the feature size with a 3×3 Conv and then increases it back after with a 1×1 Conv before adding it back to the main flow, this set-up makes the residual block even thinner.

However the same set-up that allows identity mappings to train very deep networks by allowing features to follow unimpededly from earlier layers to later layer as well as gradients no matter how small to flow through the network are at the same time its akiles heal. As the authors of WRNs[3] put it, to quote them directly with a small change at the end:

“As gradient flows through the network there is nothing to force it to go through residual block weights and it can avoid learning anything during training, so it is possible that there is either only a few blocks that learn useful representations, or many blocks share very little information with small contribution to the final goal. This problem was formulated as diminishing feature reuse in [4]. The authors of [4] tried to address this problem with the idea of randomly disabling residual blocks during training. This method can be viewed as a special case of dropout [4], where each residual block has an identity scalar weight on which dropout is applied. The effectiveness of this approach proves the hypothesis presented in the beginning of the article.”

According to the authors and the empirical code implementation of the paper that will be available at the end of the article, please go through it and you will be able to see for yourself that widening residual blocks in ResNet provides a much more doable, replicable and effective way of improving performance of residual networks compared to increasing their depth. Now, like everything in the world it has to be done in a moderated manner or else we will end up just changing our mindset but not our problems from a depth focused mentality to a width focused mentality, but if we keep a good balance between both I we have a great shot at actually solving intelligence, simply because “intelligence is wide and deep”.

Going back, the presented wider deep resnet architecture is significantly better than just plain deep resnets, having 50-fold(50x) less layers and being more than 2x faster.

Yes, exactly what you just read.

For instance, the wide 16-layer deep network has the same accuracy as a 1000-layer thin deep network and a comparable number of parameters, although being several times faster to train. This type of experiments that the authors of [3], conducted thus seem to indicate that the main power of deep residual network is in residual blocks, and that the effect of depth is supplementary.

Deep learning community: “Deeper will always be better”.

Of course increasing depth as served us well but it is time for a little change, balance is key, yin and yan.

The authors also noted in their experiments that one can train even better wide residual networks that have twice as many parameters (and more), which suggests that to further improve performance by increasing depth of thin networks one needs to add thousands of layers in this case.

Wide ResNets

There are two types residual blocks:

  • basic — with two consecutive 3 × 3 convolutions with batch normalization and ReLU preceding convolution: conv3 × 3-conv3 × 3 Fig.3(a)
  • bottleneck — with one 3 × 3 convolution surrounded by dimensionality reducing and expanding 1 × 1 convolution layers: conv1 × 1-conv3 × 3-conv1 × 1 Fig.3(b)
Fig. 4 The structure of wide resnets credits to [3]

The width of the network is determined by factor k. In the original architecture k=1. Groups of convolutions are shown in brackets where N is a number of blocks in group, downsampling performed by the first layers in groups conv3 and conv4. Final classification layers are omitted in the paper for clearance. So, go ahead and just look at this as a normal Convnet that has a flattening layer then has fully connected layers.

In this architecture the authors choose to use the pre-activation set-up in [2] for the residual blocks as it was proved to train much faster and achieve better results.

There are essentially three simple ways to increase representational power of residual blocks:

  • to add more convolutional layers per block
  • to widen the convolutional layers by adding more feature planes
  • to increase filter sizes in convolutional layers

Width

Here the authors introduce several parameters:

  • l — deeping factor(number of convolutions in a block)
  • k — widening factor(multiplies the number of features in a block)
  • d — total number of block

While the number of parameters increases linearly with l (the deepening factor) and d (the number of ResNet blocks), number of parameters and computational complexity are quadratic in k. However, it is more computationally effective to widen the layers than have thousands of small kernels as GPU is much more efficient in parallel computations on large tensors, so we are interested in an optimal d to k ratio. — WRNs paper[3]

This means we are compromising number of parameters in a model by widening residual blocks but we get a superior performance with wide resnet with less than 50-layers deep compared to jaw dropping 1000-layer deep resnet. Furthermore, this is indicative of something really worth exploring…

  • WRN-n-k denotes a residual network that has a total number of convolutional layers n and a widening factor k (for example, network with 40 layers and k = 2 times wider than original would be denoted as WRN 40–2.
Fig. 5 test error(%) over 5 runs

Let us note one thing here, pre-act-Resnet is the 1001-layer deep network.

As the the authors tried to increase widening parameter k they had to decrease total number of layers. To find an optimal ratio they experimented with k from 2 to 12 and depth from 16 to 40. The results are presented in Fig.5 . As can be seen, all networks with 40, 22 and 16 layers see consistent gains when width is increased by 1 to 12 times. On the other hand, when keeping the same fixed widening factor k = 8 or k = 10 and varying depth from 16 to 28 there is a consistent improvement, however when they further increase depth to 40 accuracy decreases (e.g., WRN-40–8 loses in accuracy to WRN-22–8).
As can be observed, wide WRN-40–4 compares favorably to thin ResNet-1001 as it achieves better accuracy on both CIFAR-10 and CIFAR-100. Yet, it is interesting that these networks have comparable number of parameters, 8.9×106 and 10.2×106 , suggesting that depth does not add regularization effects compared to width at this level. As they show further in benchmarks, WRN-40–4 is 8 times faster to train, so evidently depth to width ratio in the original thin residual networks is far from optimal.
Also, wide WRN-28–10 outperforms thin ResNet-1001 by 0.92% (with the same mini-batch size during training) on CIFAR-10 and 3.46% on CIFAR-100, having 36 times less layers (see Fig. 5).

To summarize:
• widening consistently improves performance across residual networks of different depth;
• increasing both depth and width helps until the number of parameters becomes too high and stronger regularization is needed;
• there doesn’t seem to be a regularization effect from very high depth in residual net-works as wide networks with the same number of parameters as thin ones can learn same or better representations. Furthermore, wide networks can successfully learn with a 2 or more times larger number of parameters than thin ones, which would require doubling the depth of thin networks, making them unfeasibly expensive to train.

Please find the implementation code for the in the colab bellow, please try it with your own dataset:

Disclaimer: The Wide Residual Paper was such a good read I mostly quoted them directly(word-to-word), simple because the explanation is just elegantly simple and beautiful. I believe this is the a step into the right direction with regards to future architectures and the future of AI in general.

You can find all the papers in the References section below.

“Going deep is good, but in order to solve intelligence or at least have a shot at it we need to go WIDE AND JUST THE RIGHT AMOUNT OF DEPTH” — Prince Canuma