CNN cheat sheet — the essential summary for a quick start

Original article was published by Aliaksei Mikhailiuk on Deep Learning on Medium

CNN cheatsheet — the essential summary for a quick start

Every year thousands of new papers are published at the top computer vision and computer graphics conferences such as ECCV, CVPR, ICCV, SIGGRAPH etc. It can be genuinely overwhelming to follow every new technique and method. However, it is much easier to digest new techniques when they are put into a context. This article focuses on Convolutional Neural Networks (CNN), which form a backbone of deep models for image and video processing.

Very often, my students ask me: “Where to start?”. Similarly, with a variety of methods available many slip from the minds of those more experienced. For that reason, this article is a quick start-up guide for students on CNNs and hopefully, a reference point for those with more expertise. I am also making it a CNN cheatsheet for my future self to go back and revise whenever the need arises.

Importantly! I have surely missed some gems, so please let me know, and we will make this post better!


The success and wide adoption of CNNs happened after the stunning win of AlexNet in the ImageNet competition. Where the CNN had 84.7% accuracy, compared to the score of 74.9% for the second-best method. The performance was subsequently improved by deeper models (like Inception).

The superior performance of CNNs on images can be attributed to their structure, tailored to image data. A recent paper, called Deep Image Prior, shows that CNNs architecture already incorporates the prior for natural scene statistics. Unlike fully connected networks, CNNs operate on the assumption that relationships among close pixels matter more than among those far apart.

In contrast to fully connected layers where every input is connected via an edge with every neuron in a subsequent layer, CNNs use convolutions as the base operation. Only pixels within a convolutional kernel are passed through to the neuron in the subsequent layer. Thus CNNs capture local spatial relationships within the image. Using CNNs for image data allows for a smaller number of weights, and, therefore, lightweight, easier to train models with the same or better predictive power as fully connected networks for image based tasks.

However, although architectural considerations, aimed for image data significantly improve the results, CNNs still run into problems, typical for deep models. Some of them are vanishing gradient, robustness, convergence, internal covariance shift and many many more. More specific problems also arise: what if the local relationship assumption does not hold? If generating images with CNNs, how to ensure that the generated results are perceptually pleasing? When using CNNs for classification or regression, how to deal with images of varied sizes?

Vanishing gradient

With the size of the network increases the training difficulty. The update is proportional to the size of the partial derivative. With the presence of saturating nonlinearities in the hidden layers (e.g., sigmoid), the gradient can become infinitely small while propagating towards earlier layers of the network. In extreme cases, the network stops training completely. The problem is known as the vanishing gradient.

This can be remedied with skip connections. The general idea is to combine the signal passing through the nonlinearity with the one bypassing is. An excellent article explaining these in more depth is here. Here I give examples of the two most commonly used.

Residual block

Residual layers are relatively simple, however improve training and performance significantly, as shown by the success of ResNet. For each residual layer, an input x is passed through the layer (as it would for a standard layer), resulting in the output (F(x)) which is then summed with the original input x. During back-propagation, the gradient thus propagates back through the network via the non-linearity F(x) and the original x, resulting in much more efficient updates.

U-NET is a fully convolutional network originally designed for image segmentation. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. The idea behind U-NET is similar to residual layers, however, instead of by-passing a single layer, the signal from the earlier layers within the network is linked to the symmetric layer in the signal expansion part.

Activation functions

Another way to improve back-propagation is to change the activation function. The use of sigmoid or softmax has been widely replaced in CNNs with ReLU, however even better option is available — SWISH. Swish remedies the discontinuity in the ReLU activation, significantly improving the gradient updates.


The convergence of networks (iterations required to reach the desired accuracy) can be impacted by many things. One of them is an internal covariance shift. Since training is a dynamic process, for a given hidden layer, the distribution of its outputs changes over time, as training progresses. In turn, the subsequent layer has to adjust to the changing distribution while learning its weights. By normalizing each layer’s output, the training can be significantly improved, as the problem of internal covariance shift is eliminated. Here I give examples of two commonly used normalization methods. An excellent article covering remedies of the internal covariance shift can be found here.

Algorithm 1: Batch Normalizing transform applied to activation x over a mini-batch

In batch normalization, along with the weights in the network, we train for the scale and shift parameters (γ and β).

The stability of the training is achieved by standardizing the output of the layer per activation with the mean and standard deviation of the batch. To ensure that the model also captures the scale of the input, we train for the global for a layer parameters γ and β, which bring the activations to the stable magnitude regardless of the batch. Since we do not have access to per-batch statistics in test time, we can use the mean and the standard deviation of inputs for the layer from the training dataset to standardize the layer outputs during the test.

Following the original paper’s procedure, for convolution layers, we also want the normalization to obey the convolutional property — so those different elements of the same feature map, at various locations, are normalized in the same way. To achieve this, all the activations in a mini-batch are jointly normalized over all locations. In Algorithm 1, the pair of parameters γ and β is learned per feature map, rather than per activation.

Smooth gradient updates also enable the use of much larger learning rates.

Like batch normalization, we also give each neuron its own adaptive bias and gain applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs precisely the same computation at training and test times.

Invariance and equivariance