CNN cheatsheet — the essential summary (Part 2)

Original article was published by Aliaksei Mikhailiuk on Deep Learning on Medium

When optimising for the L2 loss (i.e., mean squared difference between the generated image and the ground truth), resultant images are often blurry and hence not very visually pleasing. Instead, we want a perceptual loss function, optimizing for which we might not attain the lowest PSNR, but the produced images would be sharp and perceptually pleasing.

For cases where a ground truth image is available, image quality metrics that have a high correlation with subjective human scores, such as FSIM and SSIM, can be used as a loss function. By optimizing to maximize their scores, we are learning to generate images with high quality. Another approach relies on the assumption that pre-trained deep architectures learn image statistics in the deep representations.

As such, a VGG network pre-trained on classification or image quality (LPIPS) can be used as a loss. Here, the ground truth image and generated image are passed through the pre-trained network, and the network features generated are compared based on some distance metric, such as L2. When an exact match between the images is not required, context loss can be used.

Training reliant only on the features extracted from the deep network as a loss is unstable. Since the network implementing the function is often not bijective, due to pooling operation in the hidden layers, different inputs to the function may result in identical latent representations. Thus the loss is usually a combination of the VGG and MSE or L1.

For image generation, or where the reference image is not available, a common approach is to use generative models, the state-of-the-art method is Generative Adversarial Networks (GANs). The goal is to train a generator, a CNN for image tasks that would generate images replicating those in the pre-specified distribution. The discriminator here acts as a judge.

Allowing for varied size images

Using the same CNN on images of varied size is desirable. However, when CNNs are used as classifiers, convolutional layers are usually followed by fully connected layers. And although convolutional layers are size invariant, fully connected layers operate on a pre-specified number of inputs.

Spatial pyramidal pooling

Here before the fully connected layer, we use a fixed pooling layer that only selects a pre-defined number of elements from the convolutional layer, regardless of its size. The key idea is to extract features from the images at different resolutions and use adaptive pooling, where we adjust the number of pooled elements to achieve a fixed number of total output neurons.

The idea of building “pyramids” is not novel to imaging. A well known SIFT algorithm is arguably one of the most successful examples of its application. Another approach to achieving scale invariance is to change the representation of the image to a point cloud and to use point net or variants, which are translation and scale-invariant. More details on point clouds can be found here.

Taking spatial relations into account

A typical CNN is capable of extracting the relationship between parts of the image far apart. By virtue of convolutions and pooling layers, the input signal propagating through the network is compressed in its width and height and is usually expanded in-depth. This way, the high-level context of the signal is learned. However, since the network in deeper layers compresses the information, weak relationships might be lost.

Thus, when the relationships within the image are more complex, the predictive power of CNNs based on the assumption of local pixel relation can be substantially reduced. For example, in a game of football, the ball is removed from the image, and the network is trained to predict the position of the ball based on where the players look. This is an example of a problem where local assumption does not hold anymore. To tackle this kind of problem, convolution operation can be extended to have kernels with a larger receptive field.

In a simple convolution we would have a 3×3 kernel, for now let’s imagine that all nine elements of this kernel are non-zero. In 2-dilated convolutions (b in the figure below), we would be operating on a sparse kernel of size 5×5 with nine non-zero elements.

The idea behind the deformable convolutions is the same as behind dilated convolutions, however here we let the network learn the offset direction and magnitude durting training.

Non-local Neural Networks

Computations in a non-local block. i is the index of an output position (in space, time, or spacetime) whose response is to be computed and j is the index that enumerates all possible positions. x is the input signal (image, sequence, video; often their features) and y is the output signal of the same size as x. A pairwise func- tion f computes a scalar (representing relationship such as affinity) between i and all j. The unary function g computes a representation of the input signal at the position j. The response is normalized by a factor C(x).

Yet another mechanism for taking spacial relations into account is to learn the significance of relationships between locations in the image. In non-local neural networks a non-local block is injected in the deep layers of the network. In 2018 the model attained the state-of-the-art results in static image recognition. Non-local models also improve object de- tection/segmentation and pose estimation.

The drawback of such an approach is high computational complexity. Impeding from the use of the non-local block in the earlier (more shallow layers of the network), decreasing the precision of the spatial relationships.

Learning complex relations

The convolutions are linear operators on the underlying data patch, and allow for learning only low levels of abstraction. Replacing simple convolutions with a more expressive nonlinear function can enhance the abstraction ability of the local model. Linear operators can achieve a good extent of abstraction when the samples of the latent concepts are linearly separable, i.e. the variants of the concepts all live on one side of the separation plane defined by the convolutional filter.

Conventional CNNs implicitly make the assumption that the latent concepts are linearly separable. However, the data for the same concept often live on a nonlinear manifold, therefore the representations that capture these concepts are generally highly nonlinear functions of the input. In Network In Network, the convolution is replaced with a ‘micro network’ — a nonlinear multilayer perceptron.

Both the linear convolutional layer and the mlpconv layer map the local receptive field to an output feature vector. The mlpconv maps the input local patch to the output feature vector with a multilayer perceptron consisting of multiple fully connected layers with nonlinear activation functions. The multilayer perceptron is shared among all local receptive fields. The feature maps are obtained by sliding the perceptron. The training of such an architecture is much slower, however, the results are better.


In this short article I have covered common strategies for boosting the CNN performance. I have talked about the vanishing gradient problem, ways of improving convergence, invariance and equivariance property, taking spatial relations into account about the ways of feeding varied sized images to a CNN and also about losses for image translation algorithms, used to achieve perceptually pleasing results.

This article was intended to be short and many techniques have been omitted. If you think that I have missed something important, please do not hesitate to drop me a message or leave a comment.

Further reading

For further reading I would recommend this review article and also definitely have a closer look at the original papers proposing the summarized ideas.

P.s. Many thanks to Param Hanji and Aliaksandra Shysheya for their feedback 🙂