Evolution of Convolutional Neural Network Architectures

Original article was published on Deep Learning on Medium

  • The network is 22 layers deep (27 layers if pooling is included); a very deep model when compared to its predecessors!
  • A 1×1 convolution with 128 filters helps with dimensionality reduction and rectified linear activation.
  • An average pooling layer with 5×5 filter size and stride 3.
  • A fully connected layer with 1024 units and ReLu.
  • A linear layer with softmax used for classification


VGG-16 was the next big breakthrough in the deep learning and computer vision domains, as it marked the beginning of very deep CNNs. Earlier, models like AlexNet used high dimensional filters in the initial layers, but VGG changed this and used 3×3 filters instead. This ConvNet developed by Simonyan and Zisserman (2015) became the best performing model at that time and fueled further research into deep CNNs.

VGG-16 Architecture | Source: Neurohive.io, Simonyan and Zisserman (2015)
  • It was trained on the ImageNet dataset and achieved state-of-the-art results with up to 92.7% accuracy, beating the GoogLeNet and Clarifai.
  • It approximately had an overwhelming 138 million parameters to train which was more than at least twice the number of parameters in other models used then. Hence, it took weeks to train.
  • It had a very systematic architecture. As we move to deeper layers, the image dimensions halved, while the no. of channels (or the no. of filters used in each layer) doubled.
  • A prominent drawback of this model was that it was extremely slow to train and huge in size, making it less practical for real-time deployment.

ResNet | ResNeXt

ResNet was put forward by He et al. in 2015, a model that could employ hundreds to thousands of layers whilst providing compelling performance. The problem with deep Neural Networks was of the vanishing gradient, repeated multiplication as the network goes deeper, thereby resulting in an infinitely small gradient.

Residual Block | Source: He et al. (2015)

ResNet looks to introduce “shortcut connections” by skipping one or more layers. Here, these perform identity mappings, with outputs added to those of the stacked layers. With 152 layers (deepest back then) used, ResNet won the ILSVRC 2015 classification competition with a top 5 error of 3.57%. With an increasing demand in the research community, different interpretations of the ResNet were developed. The following model treats ResNet as an ensemble of many smaller networks.

A block of ResNeXt with cardinality = 32 | Source: Xie at al. (2017)

Xie at al. proposed this variant of the ResNet(called the ResNeXt); this is similar in looks to the Inception module (both perform split-transform-merge); however, the outputs of different paths are added together, while they are depth concatenated in the latter. Furthermore, every path is the same in terms of topology, the Inception follows varying topologies for different paths (1×1, 3×3, 5×5 convolution).

  • Authors introduce cardinality, a hyperparameter that makes the model adaptable to different datasets and increased accuracy on a higher value.
  • Divides the input into groups of feature maps to perform novel convolution, and the outputs are then fed into concatenated by the depth and fed into a 1×1 convolution layer.

DenseNet | ConDenseNet

The idea of DenseNet stemmed from the intuition that CNNs could be substantially deeper, accurate and efficient to train if there are to be shorter connections close to the input and those close to the output. In sum, every layer is connected to every other layer in a feed-forward fashion.

5-layer dense block with a growth rate of k = 4 | Source: Huang et al. (2016)
  • Has (L(L+1))/2 direct connections in the network, all layers are interconnected
  • Substantial reduction in the number of parameters, vanishing gradient handled, encourage feature reuse, encourage feature propagation
  • For the ImageNet dataset, the model is at par with state-of-the-art ResNets, whilst requiring a lesser number of parameters and less computational power
  • Can be scaled to hundreds of layers, with no difficulties in optimization
  • Shows no sign of degradation or overfitting with an increase in the number of parameters, with an increasing accuracy

CondeseNet was proposed in 2018 by Huang et al. as an improved version of DenseNet with better efficiency. Combined with a novel model called group convolution, it facilitates feature reuse and removes layers that are unnecessary. It is found to be easy to implement and outperform networks like ShuffleNet and MobileNet, taking in mind the computational efficiency at the same accuracy.