Original article was published by Divyanshu Mishra on Deep Learning on Medium
Deformable Convolutions Demystified
Deformable Convolutions are gaining popularity and are being applied in sophisticated computer vision tasks such as Object Detection. In this post, I will try to explain them in detail and shed light on their importance in future computer vision applications.
The reader of the post must have a basic understanding of Convolutional Neural Networks. If you are unfamiliar with the topic you can refer to this link and if you want to know more about the convolutional operation which is actually derived from basic image processing, you can read this blogpost as well.
Convolutional Neural Networks or CNN’s, in short, is one of the main causes of the revival of artificial intelligence research after a very long AI winter. The applications based on them were the first ones which showcased the power of artificial intelligence or deep learning to be precise and revived the faith in the field which was lost after Marvin Minsky pointed out that Perceptron just worked on linearly separable data and failed to work on the simplest non-linear functions such as XOR.
Convolutional Neural Networks are very popular in the domain of Computer Vision and almost all state of the art applications such as google images, self-driving cars etc are based on them. In very high-level, they are a kind of neural network which focus on local spatial information and use weight sharing to extract features in a hierarchical manner which are finally aggregated in some task-specific manner to give the task-specific output.
Though CNN’s are excellent for visual-recognition tasks but are very limited when it comes to modelling geometric variations or geometric transformations in object scale, pose, viewpoint and part deformation.
Geometric Transformations are basic transformations which transform the positions and orientation of an image to another position and orientation.
Some basic geometric transformations are scaling, rotation, translating etc.
Convolutional Neural Networks lack an internal mechanism to model geometric variations and can only model them using data-augmentations which are fixed and limited by the user’s knowledge and hence the CNN cannot learn geometric transformations unknown to the user.
To overcome this problem and increase the capabilities of CNN, Deformable Convolutions were introduced by Microsoft Research Asia. In their work, they introduced a simple, efficient and end-to-end mechanism which makes the CNN capable of learning various geometric transformations according to the given data.
Why Convolutional Neural Networks are unable to model geometric transformations?
The limitation of CNN to model geometric transformations arises from the fixed structure of the kernel used to sample from the feature map. A CNN kernel uses a fixed rectangular window(Figure 1)to sample from the input feature map at fixed locations, the pooling layer uses the same rectangular-shaped kernel(Figure 2) to reduce the spatial resolution at a fixed ratio. This introduces various problems such as all the activation units in a given CNN layer have the same receptive field even though there might be objects of different scales present at different spatial positions. Adapting to the scale of the object and having different receptive field sizes for different objects is desirable for visual recognition tasks requiring fine localization such as object detection, segmentation etc.
In deformable convolutions, in order to factor in the scale of different objects and have different receptive fields according to the scale of the object, 2D offsets are added to the regular grid sampling locations in the standard convolution operation thereby deforming the constant receptive field of the preceding activation unit. The offsets added are learnable from the preceding feature maps using additional convolutional layers. Thus the deformation applied depends on the input features in a local, dense and adaptive manner.
The added deformable convolutional layers add very small parameters and computation to the existing model and can be trained end-to-end using normal back-propagation.
To explain Deformable Convolutions in detail, I would first discuss the normal convolution operation and then explain the simple idea which is added to convert them to deformable convolutions.
The normal convolution operation consists of two basic steps:
- Sampling a small region of the input image or feature map using a rectangular kernel.
- Multiplying sampled values by the weights of the rectangular kernel and then summing them across the kernel to give a single scalar value.
I would explain the above two concepts both in the form of equations and visually.
Let us first try to understand using mathematical equations.
Let R be a 3×3 kernel used to sample a small region of the input feature map.
Then the equation of the normal 2d convolution operation will be given as shown in the figure below where w is the weights of the kernel, x is the input feature map, y is the output of convolution operation,p₀ is the starting position of each kernel and pₙ is enumerating along with all the positions in R.
The equation denotes the convolution operation where each position on the sampled grid is first multiplied by the corresponding value of the weight matrix and then summed to give a scalar output and repeating the same operation over the entire image gives us the new feature map.
The operation explained above is visually depicted below where the green kernel is slid over the image depicted by blue matrix and corresponding weight values are multiplied with sampled values from the image and then summed to give the final output for a given position in the output feature map.
The deformable convolution instead of using a simple fixed sampling grid introduces 2D offsets to the normal convolution operation depicted above.
If R is the normal grid, then Deformable Convolution operation augments learned offsets to the grid thereby deforming the sampling positions of the grid.
The Deformable Convolution operation is depicted by the equation below where Δpₙ denotes the offsets added to the normal convolution operation.
Now as the sampling is done on the irregular and offset locations and Δpₙ is generally fractional, we use bilinear interpolation to implement the above equation.
Bilinear Interpolation is used because as we add offsets to the existing sampling positions, we obtain fractional points which are not defined locations on the grid and in order to estimate their pixel values we use bilinear interpolation which uses a 2×2 grid of the neighbouring pixel values to estimate the pixel value of the new deformed position.
The equation that is used to perform bi-linear interpolation and estimate the pixel value at the fractional position is given below where p(p₀+pₙ+ Δpₙ) is the deformed position, q enumerates all the valid positions on the input feature map and G(..) is the bilinear interpolation kernel.
Note: G(..) is a 2 dimensional and can be broken down according to the axis into two one dimensional kernel as shown below.
Visually the Deformable Convolution is implemented as shown in Figure below.
As shown in Figure 5, the offsets are obtained by applying a convolution layer over the input feature map. The convolution kernel used has spatial resolution and dilation as those of the current convolution layer. The output offset field has the same resolution as that of the input feature map and has 2N channels where 2N corresponds to N 2d offsets.
Network Modification Details
Deformable convolution layers are mostly applied in the last few layers of the convolutional network as they are more likely to contain object-level semantic information as compared to earlier layers which extract more basic features like shapes, edges etc. Experimental results have shown that applying deformable convolutions to the last 3 convolution layer provides the best performance in tasks such as Object Detection, Segmentation etc.
Advantages of using Deformable Convolutions
The advantages of using deformable convolution operation are clearly depicted in Figure 7. As you can see, there are 4 image triplets where each image in a particular triplet depicts the receptive field with respect to a particular object. If this had been a normal convolution operation, the receptive field for all the objects in a given image should have been same. But as you can notice, the receptive field in the case of deformable convolutions is adaptive according to the scale of the object. Small scale objects such as cars in the first triplet have a smaller receptive field as compared to large scale objects. You can notice that the receptive field for the background objects is the largest depicting a large receptive field is required to detect background objects as compared to the foreground object.
In this post, I tried to explain Deformable Convolutions which are being readily applied in current novel object detection and segmentation models. The main reason they are gaining momentum is that they offer the internal mechanism which enables a convolutional neural network to model various spatial transformations. They offer the advantage of an adaptive receptive field which is learned from the data and varies according to the scale of the object.
- Deformable convolutional networks. arXiv 2017J Dai, H Qi, Y Xiong, Y Li, G Zhang, H Hu, Y Wei — arXiv preprint arXiv:1703.06211