A comparative study of different Convolutional Neural Network architectures

Source: Deep Learning on Medium

A comparative study of different Convolutional Neural Network architectures

The story starts with the ImageNet Large Scale VisualRecognition Challenge. The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured. ImageNet contains more than 20,000 categories.


On September 30, 2012, AlexNet was submitted for the challenge and ended up performing better than anything the world had ever seen. It beat all the feature selection models by a huge margin. It showed the world the power of Convolution Neural Networks. Even though heavy computation was difficult to come by, and the architects had to compromise on the number of layers, it was still promising and led to a lot of research in the field.

The following image represents the architecture of AlexNet.

It has 5 convolutional layers and 3 fully connected layers. The first thing you notice on seeing the image is that the architecture is divided into 2 parts. The design idea is that one half executes on one GPU, while the other executes on the second GPU. This parallel computing makes up for the low computation powers of the GPUs of old. The input image size is — 224 x 224 x 3(this is the standard size in ImageNet), with the number of filters being 96. The size of each filter is 11 x 11 x 3. The stride chosen is 4.

Hence the output size becomes:

224/4 x 224/4 x 96 = 55 x 55 x 96 (because of stride 4)

The rest of the architecture can be seen from the image.

The activation function used is ReLU so that no negative values propagate through the network. It accelerates the speed by 6 times at the same accuracy, compared to the tanh activation function.


This architecture is from VGG(Visual Geometry Group), Oxford. It marks an improvement over AlexNet. A major differentiation is that it uses smaller filters than AlexNet. The 11 x 11 x 3 filter in AlexNet is replaced by 3X3 kernel-sized filters one after another. The pooling is also kept at 2×2. VGG opts for a simple architecture which captures every small detail, and compensates for that by having a much deeper network. The downside is the increase in number of parameters, which reach about 160 million.

The VGG convolutional layers are followed by 3 fully connected layers. The width of the network starts at a small value of 64 and increases by a factor of 2 after every sub-sampling/pooling layer.


Well, VGG’s performance in the ImageNet project 2014 was phenomenal. Yet, it was the runner up. The winner was Inception. What was the problem with VGG? Why another network? With its width increasing after every pooling layer, VGG became extremely resource intensive, i.e., it had huge computational requirements, both in time and memory.

There has been a long-followed intuition in deep learning. The deeper you go, the better. Many deep learning practitioners have often followed this intuition and made deeper networks yielding no results. There are many reasons why a deeper network consisting of Convolutional layers stacked over one another fails, Inception addresses a few of them.

Well, one of the biggest drawbacks of a deeper network is that it is prone to overfitting, especially when the data is small. Another drawback is the huge amount of computation that comes with a deeper network, making it almost impossible to train, except on really, really powerful systems.

Another problem which Inception addresses, not related to deeper networks though, isthat the features that our Convolutional Neural Network is supposed to identify can be locally placed or globally placed. A small filter size is well suited for the former and a large filter size, for the latter. Inception tackles this problem by using filters with multiple sizes on the same level.

The 1X1 conv, 3X3 conv, 5X5 conv and 3X3 max pooling are done for the previous input and stacked together. Each of these filters extracts different kinds of features and feature maps at different paths are concatenated together to form the input for the next module.

Oops. Too much computation.

Can we somehow reduce the computation, perhaps by introducing dimensionality reduction before passing on the inputs to 3X3 and 5X5? That will not only decrease computation but also decrease overfitting.

Yes, we can!

Now look at this modification:

We add a 1X1 convolution with lesser number of filters as compared to the number of input channels before 3X3 and 5X5. Before reading this article any further, try thinking why this works.

Well, the explanation goes like this.

Suppose I had a 36X36X120 input (120 is the number of input channels) coming into my 5X5 filter and the number of desired output channels are 48 with no change in shape, i.e., 36X36.

The total number of operations = 36 X 36 X 120 X 5 X 5 X 48 = 186,624,000. Let me remind you that this is one 5X5 conv operation in one of the inception blocks.

186 million operations.

Now try using 1X1 conv with 16 output channels before 5X5 conv.

The total number of operations = (36 X 36 X 120 X 1 X 1 X 16) + (36 X 36 X 16 X 5

X 5 X 48) = 2,448,320 + 24,883,200 = 27,331,520.

A reduction from 186 million to 27 million. That’s how much powerful a 1X1 conv is. As you can see, the number of channels were 120 which got reduced to 16 then got increased to 48. This architecture, due to its shape, is called the bottleneck architecture.

This combination of features selected from three different types of filters is also effective because each of the three act like three different models on an input, making this block act like an ensemble of different models. Ensemble functions select the best out of different models, thus are often able to learn more and also prevent overfitting resulting in a reduction in variance.

Now that I have explained the Inception block architecture and how Inception deals with the above-mentioned problems, let’s see how the overall network looks.

This is simply stacking many Inception blocks over one another. This network achieved 93.3% top 5 accuracy on ImageNet and is much faster than VGG.

Residual Networks:

This one is perhaps the simplest, yet, one of the best network so far. Have a look at this:

The yellow lines indicate the training and test error on a 20 layered CNN and the red lines indicate the same for a 56 layered CNN on the CIFAR-10 dataset. As can be seen, the deeper network is performing poorly when compared to a relatively shallow network. While discussing Inception Networks, I have covered a few reasons why this happens but Inception too faces this problem after 30 layers.

Consider a network having n layers. This network produces some training error. Now consider a deeper network with m layers (m>n). When we train this network, we expect it to perform at least as well as the shallower network. Why? Replace the first n layers of the deep network with the trained n layers of the shallower network and replace the remaining (n−m) layers in the deeper network with an identity mapping (i.e., these layers will give what is fed into them without changing it in anyway). Thus, our deeper model can easily learn the shallower model’s representation. If there exists a more complex representation of data, we expect the deep model to learn this. But, as shown in the image above, the reverse happens.


The problem must be something else. Something fundamentally wrong with the way our network learns. When the weights of each layer were analysed after each iteration, as the number of iterations progressed, it was observed that the weights of the lower layers were not changing at all. Hmm… what can be the reason? Before reading any further, kindly take a 2-minute pause and think why this problem is occurring.

Suppose the activation we use is Sigmoid. During Backpropagation, the activation at each layer will be differentiated. Differential of Sigmoid is (Sigmoid)*(Sigmoid-1).The range of this differential is from 0 to 0.5. Imagine a 50 layered network with activation as Sigmoid. The computation to calculate the gradient to update the weights of the first layer will have the differential of sigmoid multiplied 50 times. That is less than (0.5)⁵⁰. That is 0.0000000000000008. Thus, the gradient used to update thefirst layer will be close to zero, resulting in no updates in the weights of the initial layers. Thus, these layers are as good as dead as they are not learning anything. This problem is called the vanishing gradient problem.

There are many solutions to this problem. ResNet is the best solution.

This is how a residual block looks like (Usually F(x) follows the bottleneck architecture, first decrease the number of input channels, and then increase the number of output channels):

Why the name Residual Block? Let the input to a neural network block be x and the true distribution to be learned be H(x). Let the residual be R(x). Then, H(x) = x + R(x). A normal neural network block would try to learn H(x) but since, due to the identity skip connection which is reintroducing x after the block, this block learns H(x)-x which is the residual. Hence, the name residual block.

Now we have two situations to consider while training, one where we are moving forward in the network and one where we are backpropagating.

Let’s say we have n layers and the error after n layers is 0.01. If on adding 2 more layers, our error still remains 0.01 then these layers are redundant and only will add to the computation. In general, we do not know the optimal number of layers required for a neural network which might depend on the complexity of the dataset. Instead of treating number of layers an important hyperparameter (something which the user provides) to tune, by adding skip connections to our network, we are allowing the network to skip training for the layers that are not useful and do not add value in overall accuracy. In a way, skip connections make our neural networks dynamic, so that it may optimally tune the number of layers during training. Thus, in this situation, the network will opt to move forward via the skip connection.

Now, during backpropagation, the gradient is passed through both F(x) and the skip connection. The gradient is reduced after passing through F(x) but is re-introduced by the skip connection (F’(g)+g, where g is the gradient) which prevents the gradient from decreasing as we move backward, thus, avoiding the vanishing gradient problem. Residual Networks have made it possible for deep learning practitioners to implement deep networks. I personally have made a 152-layer network using this architecture. Experimentally, the plain 50 layered network has higher error than the 20 layered plain network. This is where we realize the vanishing gradient problem. And the same 50 layered network when converted into the residual network has much lesser training error than the 20 layered residual network.

Now, lets look at how a ResNet looks when we stack these Residual Blocks over one another.

This network achieved 95.5% top 5 accuracy on ImageNet and is faster than VGG. These 4 architectures (AlexNet, VGG, Inception, ResNet) can be called the 4 main checkpoints in the history of Computer Vision using Convolutional Neural Networks. After knowing what these networks are, it is easy to decide which network to use. Analyse the dataset, if there are a few number of features to be considered for AlexNet or VGG depending on the size of your features. Avoid VGG if you do not have good resources for very heavy computation. If you have a lot of features of varied sizes, go for Inception. If you want deeper networks, go for ResNets, even better, go for Inception-ResNet by replacing F(x) in the residual block by an Inception block.

Fun fact, a lot of network architectures were not derived from reason, but from experimentation. These authors tried a new architecture that worked, and then came up with a reason as to why it is working.

Keep on experimenting with your network’s architecture. Who knows, maybe you will stumble upon something new?

-Kiran Muthigi

Birla Institute of Technology, Mesra