Convolution layers

Original article was published by Jehill Parikh on Artificial Intelligence on Medium

Convolution layers are fundamental building blocks of computer vision architectures. Neural networks employing convolutions layers fin in wide ranging applications for Segmentation, Reconstruction, Scene Understanding, Synthesis, Object detection.

The goal of this post is to provide a summary and overview of advanced convolution layers and techniques which as they emerged in recent literature. We start with basics of convolution, for completeness however, more rigorous explanations with reference more rigorous explanations.

Convolution: Mathematically speak convolution is an “operation” performed to combine two signals into one, below is an illustration from wikipedia which highlights convolution between two functions/signals f(t) and f(t-z). Convolution to obtain the (f*g)(t)

Convolution from wikipedia

The main convolution operations in deep learning are

  1. “2D” Convolution:

Pictorially we convolve “slide” a kernel (green size) over an image (blue) and learn weights for the these kernel. This kernel’s spatial extent (F) is 3 and filters i.e. depth of the kernel is 1, therefore number of weights are 3*3*1=9. We can skip pixels note by “stride” and pad regions our original image, here the stride is 0.

Image source

Convolution block of accepts a image of size W1×H1×D1 and kernel of size (F*F*K)

Requires four hyper-parameters:

  • Number of filters K: K
  • Their spatial extent F: F
  • The stride S
  • The amount of zero padding: P : if we zero pad any image
  • Number of parameter are (channels*F*F)*K
  • Typically shape of the weight matrix is (F, F, Channels, K)

These there operations combined provide a final output features map of W2*H2*D2 for details of working see post

  • W2 =(W1−F+2P)/S+1W2=(W1−F+2P)/S+1
  • H2 = (H1−F+2P)/S+1H2=(H1−F+2P)/S+1

In addition two additional operations are employed with a convolution operation

  • Max-pool operation: this is the reduce the number dimension of the images: generally a 2*2 filter in “max” pool operation is employed. The filter replaces each 2*2 block in the image/feature map with the max of the that block, this is reduced the size of the feature maps for the following layer.
  • Non-linearity (relu, tanh etc) are employed following a max-pool operation, [TODO, future work: add notes on performance of the various non-linearities]

Spatial extent of the filter (F) is the main contributor of the number of weights for each kernel. As explained in CS231n lecture notes with F=3 and F=7 there is a three fold increase in number of weights. Typically, a full deep-net consists of the multiples layers of the CONV + RELU + POOL operations to extract features. Thus generally the F=3 is employed as trade off chosen because of to learn features vectors at each layer with computational times. This leads to the typical ConvNet type of architecture consisting of stacked application of architecture CONV + RELU + POOL operations.

Additional recommended reading the computation considerations section on C231n lecture notes

Key architectures for object classification tasks are well summarised in the CS231n notes, they are LeNet, AlexNet, ZFNet, GoogLeNet, VGGnet, Resnet. These were mainly development mainly driven by the Imagenet challenge over the years. There is was trend to larger/deeper the network deeper to improve performance.

Image from Resnet article

Residual/skipped connection

These were introduced in 2015 by a Microsoft team to maintain the accuracy in deeper networks as part of the ImageNet challenge. A single network skipped connection is shown below, it aims to learn Residual R(x), compared to the standard network block with tried to learning H(x). In deeper networks we keep on learning “residual information” at each layer. This experimentally has proven to increase accuracy in deeper networks via incremental learning (loosely speaking). Hence the name residual connection. Implementing a skipped connection is very straight forward, as shown below. Skipping the connection also allows us to over come to issue of vanishing gradients, in deep layers, and speeds up training. This experimental results has been widely adopted across a range of computer vision algorithms since original introduction. Variants of the resnets are highlighted in this post, and additional detail and illustrations please see the blog.

R(x) = Output — Input = H(x) xdef res_net_block(input_data, filters, conv_size):  x = layers.Conv2D(filters, conv_size, activation='relu',
x = layers.BatchNormalization()(x)
x = layers.Conv2D(filters, conv_size, activation=None,
x = layers.BatchNormalization()(x)
x = layers.Add()([x, input_data])
x = layers.Activation('relu')(x)
return x

Above layers are the critical components of computer vision building blocks, and find application in wide range of domains and different architecture. Now we turn to more specialised layers.

Convolution transpose

Convolution operation with stride (≥2) and/or padding reduces the dimensions of the resultant feature map. Convolution transpose is the reverse process employed to learn kernels to up sample features maps to larger dimensions. This a stride = 2 is typically to upsample the image, see well illustrated in a post by Thom Lane and below, where are 2*2 input with padding and stride convoluted with a 3*3 kernel leads to a 5*5 feature map

Image credit: Thom Lane’s blog post

Stride covolutions find wide ranging application area

  1. U-nets (medical image segmentation)
  2. Generative models: GAN: Generators
  3. VAE: Decoder, up-sampling layers

Implementation: All major DL frameworks have convolution and ML frameworks, with proper initialisation, random or Xavier initialisation.

Masked convolution

Masked and Gated convolution started gaining popularity around 2016, in there seminal work Aaron van den Oord, et al introduced Pixel RNN and Pixel-CNN. These are auto regressive approach to sample pixel from a probability distribution conditional on the previous pixels.

Reference Pixel RNN

Since each pixel is generated conditioning on previous pixels, to ensure conditioning on pixel from the left and top mask are employed while applying convolution operations. Two types of masks are Mask: A, used in first channel and prior pixels. Mask: B, mask B all channels and pixels, following layers, both available here

Image from Pixel RNN 2016

Notes: Inputs must be binariesed: e.g. 256 bits of colour, or each sample is between 0–256 in a RGB image

Masked gated convolutions were introduced avoiding blind-spots in masked convolutions. Ord et al proposed to isolate horizontal and vertical stacks i..e gates along with 1*1 convolution, with residual connections in the horizontal stacks, as shown below. Residual connections in vertical stacks didn’t offer additional performance improvement, therefore were not implemented.

Gate Convolution block introduced by van den Oord et al 2016

Implementation of masked gate convolutions is available here.

Applications areas

  1. Pixel CNN decoder: Pixel condition on previous values, leading better performance. Longer inference/sampling as needs to be done on pixel by pixel basis. These were optmised in PixelCNN++ and implementation available on available on Tensorflow Probability.
  2. VAE: Pixel-VAE: combines, each dimension is diagonal element of the covariance matrix and this Gaussian assumption leads to poor sample images, so combines a traditional decoder with PixelCNN to help “small scale/similar” aspects of the distribution and this puts less demands on latent to learn more global information, via demonstrated this via the improvements to KL term, leading to improved performance. Implementation see here
  3. VQ-VAE and VQ-VAE2: uses Pixel-CNN in latent code map of VAE to avoid Gaussian assumption of the latent variables all together, leading to better images in higher dimensions. VQ-VAE implementation were open sourced by the authors, and other implementation are also widely available


Mainly employed in as decoders, for e.g. in VAE frameworks for prior sampling to avoid issues with training GAN’s such as mode collapse and generate high resolution images.

Invertible convolutions

Invertible convolutions, are based on normalising flows, are currently applied in generative models to learn underlying probability distribution p(x). The maintain motivation is to provide a better loss function i.e. negative log likelihood.

The two most common generative frameworks modelling suffer from approximation inference issues, i.e. loss function function for VAE (evidence based lower bound i.e. ELBO) is an approximation is lower bound on log-likelihood, therefore inference/reconstruction is approximation. Adversarial loss employed in GAN’s is a “search” based approach and suffer issues with sampling diversity and are hard to train i.e. mode collapse.

Mathematical preliminaries of well Normalising flows, best outlined in the Stanford class notes here, use there summary.

In a normalising flow, the mapping between random variable Z and X, given by, function f, paramaterized θ, which is deterministic and invertible such that

Then probability distributions of X and Z can then obtained using the change of variable formulation.


  1. “Normalising” means that the change of variables gives a normalised density after applying an invertible transformation
  2. “Flow” means that the invertible transformations can be composed with each other to create more complex invertible transformations.

Note: function can be a single function, or a series of sequential function, transforming generally, transforming from simple e.g. latent vectors (Z) to complex distributions e.g. (X) images. This formulation allows to us “exactly” between transform between two distribution, and thus can derive the negative log likehood loss function, see the lecture for derivation.

Normalising flow and neural networks: Works of Dinh et al, 2014 (NICE) and Dinh et al, 2017 (Real-NVP), started providing neural network architecture, to employ normalising flow for the density estimation. Glow from Kingma et al, is the current (2018) state of art, which builds on these works, In which which introduced 1*1 invertible convolution, to synthesis high resolution images.

Glow articheture ref
Glow implementation of the invertible convolution

The key novelty was to reduce the computation cost of the determinant term for the weight matrix, for 1*1 learning invertible convolutions. This was achieved with LU decomposition with permutation, i.e. PLU decomposition for the weight matrix. Random permutations were employed to maintain “flow” at each iteration. The mathematical details are covered in section 3.2, of the paper they also provide an implementation using numpy and tensorflow, for easier interrogation.

These were further generalised to N*N convolutions optimised, by Hoogeboom, et al, please see the blog post for additional details and implementations. Our aim was just to highlight these models, for more comprehensive details please read see the references, CS236 lectures 7 and 8 and Glow paper and blog post by Lilian Weng.

Application area: Image synthesis and generative modelling