An overview of image classification networks

Learning about the different network architectures for image classification is a daunting task. In this blog post I will discuss the main architectures that are currently available in the keras package. I will go through these architectures in a chronological order and attempt to discuss their advantages and disadvantages from the perspective of a practitioner.

Key concepts

Although different researchers in the computer vision field tend to follow different practices, overall you can see the following trends when setting up experiments. I discuss how the images are pre-processed, what type of data augmentation is used, the optimisation mechanism and the implementation of the final layer.

Often, the mean pixel value is computed over the training set and subtracted from the images. Note that it is important to take this into account when using these models with keras. Keras provides different ‘pre-processing’ functions for each of the computer vision models.

Data augmentation
Image classification research datasets are typically very large. Nevertheless, data augmentation is often used in order to improve generalisation properties. Typically, random cropping of rescaled images together with random horizontal flipping and random RGB colour and brightness shifts are used. Different schemes exist for rescaling and cropping the images (i.e. single scale vs. multi scale training). Multi-crop evaluation during test time is also often used, although computationally more expensive and with limited performance improvement. Note that the goal of the random rescaling and cropping is to learn the important features of each object at different scales and positions. Keras does not implement all of these data augmentation techniques out of the box, but they can easily implemented through the preprocessing function of the ImageDataGenerator modules. The data augmentation techniques by Andrew Howard explain the key methods more in-depth.

An example of different crops of the same picture (Image taken from Andrew Howard’s paper).

Training mechanism
Models are typically trained with a batch size of 256, using multi-GPU data parallelism, which is also available in keras. Either SGD with momentum or RMSProp is often used as optimisation technique. Learning rate schemes are often fairly simple, either lowering the learning rate when the validation loss or accuracy starts to stabilise or lowering the learning rate at a fixed interval. With the ‘ReduceLROnPlateau’ callback in keras, you can easily mimick this behaviour.

An example of the training procedure where the LR is reduced then a plateauing loss is noticed.

Final layer
The final layer in an image classification network was traditionally a fully connected layer. These layers are massive parameters hogs as you need NxM parameters to go from N to M hidden layer nodes. Nowadays, these layers have been replaced by average or max pooling layers requiring less parameters and computational time. When fine-tuning pre-trained networks in keras, it is important to take this into account to limit the number of parameters that are added.


Originally published in 2014 by Karen Simonyan and Andrew Zisserman, VGGNet showed that stacking multiple layers is a critical component for good performance in computer vision. Their published networks contain 16 or 19 layers and consist primarily of small 3×3 convolutions and 2×2 pooling operations.

The authors main contribution is showing that stacking multiple small filters without pooling can increase the representational depth of the networks while limiting the number of parameters. By stacking e.g. three 3×3 conv. layers instead of using a single 7×7 layer several limitations are overcome. First, three non-linear functions are combined instead of a single one, which makes the decision function more discriminative and expressive. Second, the number of parameters is reduced by 81% while the receptive field stays the same. Working with smaller filters thus also acts as a regularizer and improves the effectiveness of the different convolutional filters.

A downside of the VGGNet is that it is more expensive to evaluate than shallow networks and uses a lot more memory and parameters (140M). A lot of these parameters can be attributed to the first fully connected layer. It was shown that these layers can be removed with no performance downgrade while significantly reducing the number of necessary parameters. VGG is available on keras with pre-trained weights, both in the 16 and 19 layers variant.


The ResNet architecture was developed by Kaiming He et al. in an attempt to train networks with even larger depth. The authors noted that increasing the network depth resulted in a higher training loss, indicating potential training convergence issues due to gradient problems (exploding/vanishing gradients).

Although the space of potential functions of a 20-layer network is encapsulated within the space of a 56-layer network, with convential gradient descent, it is not possible to achieve the same results. (Image taken from the ResNet paper)

Their main contribution is the addition of skip connections to neural network architectures, using batch normalisation and removing the fully connected layers at the end of the network.

With a skip connection, the input x of a convolutional layer is added to the output. As a result, the network must only learn ‘residual’ features and existing learned features are easily retained (Image taken from the ResNet paper).

Skip connections are based on the idea that as long as a neural network model is able to ‘properly’ propagate the information from the previous layers to the next layer, it should be able to become ‘indefinitely’ deep. In case no additional information is aggregated by going deeper, than a convolutional layer with a skip connection can act as the identity function.
By adding skip connections to the network the default function of a convolutional layer becomes the identify function. Any new information that the filters learn, can be subtracted or added to the base representation and it is thus easier to optimize the residual mapping. Skip connections do not increase the number of parameters, but do result in more stable training and a significant performance boost because deeper networks can be attained(e.g. networks of depth 34, 50, 101 and 152). Note that 1×1 convolutions are used to map a layers input to its output!

In addition to the skip-connections, batch normalisation was used after each convolution and before each activation. Finally, fully-connected layers were removed and instead an average pooling layer was used to reduce the number of parameters. The increased abstraction power of the convolutional layers due to a deeper network reduces the need for fully connected layers.


The GoogLeNet paper was published around the same time as the ResNet paper but introduces different improvements. The previous two papers focused on increasing the representation depth of classification network.
With GoogLeNet however, the authors still attempt to scale up networks (up to 22 layers) but at the same time they aim to reduce the number of parameters and required computational power. The original Inception architecture was published by Google and focused on applying CNN’s in big-data scenario’s as well as mobile settings. The architecture is fully convolutional and consists of Inception modules. The goal of these modules is to increase the convolutional filters learning abilities and abstraction power by constructing a complex filter which consists of multiple building blocks (e.g. a network in network — Inception).

An example of an Inception module. The 1×1 convolutions are performed to reduce the dimensions of input/output (Image taken from the GoogLeNet paper).

In addition to Inception modules, the authors also used auxiliary classifiers to promote more stable and better convergence. The idea of auxiliary classifiers is that several different image representations are used to perform classification (yellow boxes). As a result, gradients are calculated at different layers in the model, which can then be used to optimise training.

A visual representation of the GoogLeNet architecture. The yellow boxes indicate the presence of auxiliary classifiers.


With the Inceptionv3 architecture a couple of innovations are combined.
In Inceptionv3, the primary focus is on reusing some of the original ideas GoogLeNet and VGGNet, i.e. using the Inception module and expressing large filters more efficiently with a series of smaller convolutions. In addition to small convolutions, the authors also experiment with assymmetric convolutions (e.g. replacing nxn by nx1 and 1xn instead of multiple 2×2 and 3×3 filters).

An example of a 3×3 filter followed by a 1×1 filter, effectively replacing a 5×5 filter (Image taken from the Inceptionv3 paper).

The authors improved regularisation of the network by performing batch-normalisation and label-smoothing. Label-smoothing is the practice of assigning each class some weight, instead of assigning the ground truth label the full weight. As the network will overfit less on the training labels, it should be able to generalise better, which is is a similar practice as using L2 regularisation.

A lot of care was put into ensuring that the model would perform well on both high and low resolution images, which is enabled by the Inception modules analysing the image representations at different scales. As a result, when Inception networks are used in object detection framework, they perform well as classifying small and low resolution objects.


The last image classification architecture I will discuss is NASNet, which was constructed using the Neural Architecture Search (NAS) framework. The goal of NAS is to use a data-driven and intelligent approach to constructing the network architecture instead of intuition and experiments. Although I won’t go into details of the framework, the general idea is explained.
In the Inception paper, it was shown that a complex combination of filters in a ‘cell’ can significantly improve results. The NAS framework defines the construction of such a cell as an optimisation process, and then stacks the multiple copies of the best cell to construct a large network.

Ultimately, two different cells are constructed and used to train the full model.


If you have any questions, I’ll be happy to read them in the comments. Follow me on Medium or Twitter if you want to receive updates on my blog posts!

Source: Deep Learning on Medium