Talented Mr. 1X1: Comprehensive look at 1X1 Convolution in Deep Learning

Source: Deep Learning on Medium

Talented Mr. 1X1: Comprehensive look at 1X1 Convolution in Deep Learning

Image adopted from this Link

With startling success of AlexNet in 2012, the Convolutional Neural Net (CNN) revolution has begun! The CNN based frameworks in Deep Learning like GoogleNet, ResNet and several variations of these have shown spectacular results in the object detection and semantic segmentation in computer vision.

When you start to look at most of the successful modern CNN architectures, like GoogleNet, ResNet and SqueezeNet you will come across 1X1 Convolution layer playing a major role. At first glance, it seems to be pointless to employ a single digit to convolve with the input image (After all wider filters like 3X3, 5X5 can work on a patch of image as opposed to a single pixel in this case). However, 1X1 convolution has proven to be extremely useful tool and employed correctly, will be instrumental in creating wonderfully deep architectures.

In this article we will have a detailed look at 1X1 Convolutions

First a quick recap of Convolutions in Deep Learning. There are many good blogs and articles that intuitively explain what convolutions are and different types of convolutions (few of them are listed in the reference). While we will not delve deep into the convolutions in this article, understanding couple of key points will make it easier to get what 1X1 convolution is doing and most importantly How & Why it is doing it.

Quick Recap: Convolution in Deep Learning

As mentioned, this article will not provide a complete treatment of theory and practice of Convolution. However, we will recap key principles of Convolution in deep learning. This will come in handy when we examine 1X1 Convolution in depth.

Simply put, Convolutions is an element wise multiplication and summation of the input and kernel/filter elements. Now the data points to remember

1. Input matrix can and, in most cases, will have more than one channel. This is sometimes referred to as depth

a. Example: 64X64 pixel RGB input from an image will have 3 channels so the input is 64X64X3

2. The filter has the same depth as input except in some special cases (example 3D Convolutions to reconstruct medical images). This specific point, for some unknown reason, is not explicitly mentioned in most of the literature, causing some misunderstanding (Especially for someone new to convolutions, Deep learning etc)

a. Example: filter of 3X3 will have 3 channels as well, hence the filter should be represented as 3X3X3

3. Third and critical point, the output of Convolution step will have the depth equal to number of filters we choose.

a. Example: Output of Convolution step of the 3D input (64X64X3) and the filter we chose (3X3X3) will have the depth of 1 (Because we have only one filter)

The Convolution step on the 3D input 64X64X3 with filter size of 3X3X3 will have the filter ‘sliding’ along the width and height of the input.

Image is adopted from this link

So, when we convolve the 3D filter with the 3D image, the operation moves the filter on the input in 2 directions (Along the width and height) and we do the element wise multiplication and addition at each position to end up with an output with a depth of 1.

Image is adopted from this Link

Armed with this, we are ready to dive into the 1X1 convolution

1X1 Convolution — What is it?

Introduced first in a paper by Min Lin et all in their Network In Network, the 1X1 Convolution layer was used for ‘Cross Channel Down sampling’ or Cross Channel Pooling. In other words, 1X1 Conv was used to reduce the number of channels while introducing non-linearity.

In 1X1 Convolution simply means the filter is of size 1X1 (Yes — that means a single number as opposed to matrix like, say 3X3 filter). This 1X1 filter will convolve over the ENTIRE input image pixel by pixel.

Staying with our example input of 64X64X3, if we choose a 1X1 filter (which would be 1X1X3), then the output will have the same Height and Weight as input but only one channel — 64X64X1

Now consider inputs with large number of channels — 192 for example. If we want to reduce the depth and but keep the Height X Width of the feature maps (Receptive field) the same, then we can choose 1X1 filters (remember Number of filters = Output Channels) to achieve this effect. This effect of cross channel down-sampling is called ‘Dimensionality reduction’.

Image is adopted from this Link

Now why would we want to something like that? For that we delve into usage of 1X1 Convolution

Usage 1: Dimensionality Reduction/Augmentation

Winner of ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2014, GoogleNet, used 1X1 convolution layer for dimension reduction “to compute reductions before the expensive 3×3 and 5×5 convolutions”

Let us look at an example to understand how reducing dimension will reduce computational load. Suppose we need to convolve 28 X 28 X 192 input feature maps with 5 X 5 X 32 filters. This will result in 120.422 Million operations

Let us do some math with the same input feature maps but with 1X1 Conv layer before the 5 X 5 conv layer

By adding 1X1 Conv layer before the 5X5 Conv, while keeping the height and width of the feature map, we have reduced the number of operations by a factor of 10. This will reduce the computational needs and in turn will end up being more efficient.

GoogleNet paper describes the module as “Inception Module” (Get it — DiCaprio’s “We need to go DEEPER” in the movie Inception)

Usage 2: Building DEEPER Network (“Bottle-Neck” Layer)

2015 ILSVRC Classification winner, ResNet, had least error rate and swept aside the competition by using very deep network using ‘Residual connections’ and ‘Bottle-neck Layer’.

In their paper, He et all explains (page 6) how a bottle neck layer designed using a sequence of 3 convolutional layers with filters the size of 1X1, 3X3, followed by 1X1 respectively to reduce and restore dimension. The down-sampling of the input happens in 1X1 layer thus funneling a smaller feature vectors (reduced number of parameters) for the 3X3 conv to work on. Immediately after that 1X1 layer restores the dimensions to match input dimension so identity shortcuts can be directly used. For details on identity shortcuts and skip connection, please see some of the Reviews on ResNet (Or you can wait for my future work!)

Image adopted from this Link

Usage 3: Smaller yet Accurate Model (“FIRE-MODULE” Layer)

While Deep CNN Models have great accuracy, they have staggering number of parameters to deal with which increases the training time and most importantly need enterprise level computing power. Iandola et all proposed a CNN Model called SqueezeNet that retains AlexNet level accuracy while 50X times smaller in terms of parameters.

Smaller models have number of advantages, especially on use-cases that require edge computing capabilities like autonomous driving. Iandola et all achieved this by stacking a bunch of “Fire Modules” which comprise of

1. Squeeze Layer which has only 1X1 Conv filters

2. This feeds an Expansion layer which has mix of 1X1 and 3X3 filters

3. The number of filters in Squeeze Layer are set to be less than number of 1X1 filters + Number of 3X3 in Expand Layer

By now it is obvious what the 1X1 Conv filters in Squeeze Layer do — they reduce the number of parameters by ‘down-sampling’ the input channels before they are fed into the Expand layer.

The Expansion Layer has mix of 1X1 and 3X3 filters. The 1X1 filters, as you know, performs cross channel pooling — Combines channels, but cannot detect spatial structures (by virtue of working on individual pixels as opposed to a patch of input like larger filters). The 3X3 Convolution detects spatial structures. By combining these 2 different sized filters, the model becomes more expressive while operating on lesser parameters. Appropriate use of padding makes the output of 1X1 and 3X3 convolutions the same size so these can be stacked.

Conclusion

In this article we reviewed high level Convolution mechanism and threw ourselves into the deep end with 1X1 Convolution to understand the underpinnings, where they are effectively used and to what end.

To recap, 1X1 Convolution is effectively used for

1. Dimensionality Reduction/Augmentation

2. Reduce computational load by reducing parameter map

3. Add additional non-linearity to the network

4. Create deeper network through “Bottle-Neck” layer

5. Create smaller CNN network which retains higher degree of accuracy

References

1. Andrew Ng’s Video on 1X1 Convolution

https://www.coursera.org/lecture/convolutional-neural-networks/networks-in-networks-and-1×1-convolutions-ZTb8x

2. Comprehensive Introduction to Different Types of Convolution in Deep Learning

3. Neural Network Architectures

4. Network in Network — Min Lin et All

5. Going Deeper with Convolutions — Christian Szegedy et All

6. Deep Residual Learning for Image Recognition — Kaiming He et All

7. SqueezeNet — Forest Iandola et All

8. CNN Architecture — Lecture 9 (Stanford) : Fei-Fei Lin et All

9. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Song Han et All