Review: DCN — Deformable Convolutional Networks, 2nd Runner Up in 2017 COCO Detection (Object…

Source: Deep Learning on Medium

Go to the profile of SH Tsang

After reviewed STN, this time, DCN (Deformable Convolutional Networks), by Microsoft Research Asia (MSRA), is reviewed.

(a) Conventional Convolution, (b) Deformable Convolution, (c) Special Case of Deformable Convolution with Scaling, (d) Special Case of Deformable Convolution with Rotation

Conventional/Regular convolution operates on a pre-defined rectangular grid from an input image or a set of input feature maps, based on the defined filter size. This grid can be the size of 3×3 and 5×5, etc. However, objects that we want to detect and classify can be deformed or occluded within the image.

In DCN, the grid is deformable in the sense that each grid point is moved by a learnable offset. And the convolution operates on these moved grid points, which thereby is called deformable convolution, similarly for the case of deformable RoI pooling. By using these two new modules, DCN improves the accuracy of DeepLab, Faster R-CNN, R-FCN, and FPN etc.

Finally, by using DCN+FPN+Aligned Xception, MSRA won the 2nd Runner Up in COCO Detection Challenge and 3rd Runner Up in Segmentation Challenge. It is published in 2017 ICCV with more than 200 citations. (SH Tsang @ Medium)


  1. Deformable Convolution
  2. Deformable RoI Pooling
  3. Deformable Positive-Sensitive (PS) RoI Pooling
  4. Deformable ConvNets Using ResNet-101 & Aligned-Inception-ResNet
  5. Ablation Study & Results
  6. More Results on COCO Detection Challenge Using Aligned Xception

1. Deformable Convolution

Deformable Convolution
  • Regular convolution is operated on a regular grid R.
  • Deformable convolution is operated on R but with each points augmented by a learnable offset ∆pn.
  • Convolution is used to generate 2N number of feature maps corresponding to N 2D offsets ∆pn (x-direction and y-direction for each offset).
Standard Convolution (Left), Deformable Convolution (Right)
  • As shown above, the deformable convolution will pick the values at different locations for convolutions conditioned on the input image or feature maps.
  • Compared with Atrous convolution: Atrous convolution has a larger but fixed dilation value during convolution while deformable convolution, different dilation values are applied to each point in the grid during convolution. (Atrous convolution is also called dilated convolution or hole algorithm.)
  • Compared with Spatial Transformer Network (STN): STN performs transform on the input image or feature maps while deformable convolution can be treated as a extremely light-weight STN.

2. Deformable RoI Pooling

Deformable RoI Pooling
  • Regular RoI pooling converts an input rectangular region of arbitrary size into fixed size features.
  • In Deformable RoI pooling, firstly, at the top path, we still need regular RoI pooling to generate the pooled feature map.
  • Then, a fully connected (fc) layer generates the normalized offsets p̂ij and then transformed to offset ∆pij (equation at bottom right) where γ=0.1.
  • The offset normalization is necessary to make the offset learning invariant to RoI size.
  • Finally, at the bottom path, we perform deformable RoI pooling. The output feature map is pooled based on regions with augmented offsets.

3. Deformable Positive-Sensitive (PS) RoI Pooling

Deformable Positive-Sensitive (PS) RoI Pooling (Colors are important here)
  • For original Positive-Sensitive (PS) RoI pooling in R-FCN, all the input feature maps are firstly converted to k² score maps for each object class (In total C+ 1 for C object classes + 1 background) (It is better to read R-FCN to understand the original PS RoI pooling first. If interested, please read review about it.)
  • In deformable PS RoI pooling, firstly, at the top path, similar to the original one, conv is used to generate 2k²(C+1) score maps.
  • That means for each class, there will be k² feature maps. These k² feature map represents the {top-left (TL), top-center (TC), .. , bottom right (BR)} of the object that we want to learn the offsets.
  • The original PS RoI pooling for the offset (top path) is done in the sense that they are pooled with the same area and the same color in the figure. We get the offsets here.
  • Finally, at the bottom path, we perform deformable PS RoI pooling to pool the feature maps augmented by the offsets.

4. Deformable ConvNets Using ResNet-101 & Aligned-Inception-ResNet

4.1. Aligned-Inception-ResNet

Aligned-Inception-ResNet Architecture (Left), Inception Residual Block (IRB) (Right)
  • In original Inception-ResNet, suggested in Inception-v4, there is alignment problem that, for a cell on the feature maps close to the output, its projected spatial location on the image is not aligned with the location of its receptive field center.
  • In Aligned-Inception-ResNet, we can see that within the Inception Residual Block (IRB), all asymmetric convolutions (e.g.: 1×7, 7×1, 1×3, 3×1 conv), used for factorization, is removed. Only one type of IRB is used as shown above. Also, the number of IRB is different from either Inception-ResNet-v1 or Inception-ResNet-v2.
Error Rates on ImageNet-1K validation.

4.2. Modified ResNet-101 & Aligned-Inception-ResNet

  • Now we got two backbones: ResNet-101 & Aligned-Inception-ResNet for feature extraction, which is originally used for image classification task.
  • However the output feature map is too small which is not good for object detection and segmentation tasks.
  • Atrous convolution (or dilated convolution) is used to reduce at the beginning of the last block (conv5), stride is changed from 2 to 1.
  • Thus, the effective stride in the last convolutional block is reduced from 32 pixels to 16 pixels to increase the feature map resolution.

4.3. Different Object Detectors

  • After feature extraction, different object detectors or segmentation schemes are used such as DeepLab, class-aware RPN (or treated as simplified SSD), Faster R-CNN and R-FCN.

5. Ablation Study & Results

Semantic Segmentation

  • PASCAL VOC, 20 categories, VOC 2012 dataset with additional mask annotations, 10,582 images for training, 1,449 images for validation. mIoU@V is used for evaluation.
  • Cityscapes, 19 categories + 1 background category, 2,975 images for training, 500 images for validation. mIoU@C is used for evaluation.

Object Detection

  • PASCAL VOC, union of VOC 2007 trainval and VOC 2012 trainval for training, VOC 2007 test foe evaluation. mAP@0.5 and mAP@0.7 are used.
  • COCO, 120k images in the trainval, 20k images in the test-dev. mAP@[0.5:0.95] and mAP@0.5 are used for evaluation.

5.1. Applying Deformable Convolution on Different Number of Last Few Layers

Results of using deformable convolution in the last 1, 2, 3, and 6 convolutional layers (of 3×3 filter) in ResNet-101
  • Both 3 and 6 deformable convolutions are also good. Finally, 3 is chosen by authors due to a good trade-off for different tasks.
  • And we can also see that DCN improves DeepLab, class-aware RPN (or treated as simplified SSD), Faster R-CNN and R-FCN.

5.2. Analysis of Deformable Convolution Offset Distance

Analysis of deformable convolution in the last 3 convolutional layers
Examples: three levels of 3×3 deformable filters for three activation units (green points) on the background (left), a small object (middle), and a large object (right)
  • An analysis is also performed as above to illustrate the effectiveness of DCN. First, the deformable convolution filters are categorized into four classes: small, medium, large, and background, according to the ground truth bounding box annotation and where the filter center is.
  • Then, mean and standard deviation of dilation value (offset distance), are measured.
  • It is found that the receptive field sizes of deformable filters are correlated with object sizes, indicating that the deformation is effectively learned from image content.
  • And the filter sizes on the background region are between those on medium and large objects, indicating that a relatively large receptive field is necessary for recognizing the background regions.
Offset parts in deformable (positive sensitive) RoI pooling in R-FCN and 3×3 bins (red) for an input RoI (yellow)
  • Similarly for deformable RoI pooling, now the parts are offset to cover the non-rigid objects.

5.3. Comparison with Atrous Convolution on PASCAL VOC

Comparison of Atrous Convolution & Deformable Convolution

5.4. Model Complexity and Runtime on PASCAL VOC

Model Complexity and Runtime
  • Deformable ConvNets only add small overhead over model parameters and computation.
  • Significant performance improvement is from the capability of modeling geometric transformations, other than increasing model parameters.

5.5. Object Detection on COCO

Object Detection on COCO test-dev (M: Multi-Scale Testing with Shorter Side {480, 576, 688, 864, 1200, 1400}, B: Iterative Bounding Box Average)
  • Using Deformable ConvNet consistently outperforms the plain one.
  • With Aligned-Inception-ResNet, using R-FCN with Deformable ConvNet, plus multi-scale testing and iterative bounding box average, 37.5% mAP@[0.5:0.95] is obtained.

6. More Results on COCO Detection Challenge Using Aligned Xception

  • The above results are from the paper. They also presented a new result in ICCV 2017 conference.

6.1. Aligned Xception

Aligned Xception
  • The update of aligned Xception from original Xception is in blue colors.
  • To be brief, some of the max pooling operations are replaced by separable conv in the entry flow. The number of repeating is increased from 8 to 16 in the middle flow. One more conv is added in the exit flow.

6.2. COCO Detection Challenge

Object Detection on COCO test-dev
  • ResNet-101 as feature extractor and FPN+OHEM as object detector: 40.5% mAP is obtained which is already higher than the results mentioned in the previous section.
  • Replace ResNet-101 by Aligned Xception: 43.3% mAP.
  • With ensemble of 6 models + other small enhancements: 50.7% mAP.
  • In the COCO 2017 detection challenge leaderboard, 50.4% mAP which makes it become 2nd Runner Up in the challenge.
  • In the COCO 2017 segmentation challenge leaderboard, 42.6% mAP which makes it become 3rd Runner Up in the challenge.
  • The leaderboard:


[2017 ICCV] [DCN]
Deformable Convolutional Networks

My Previous Reviews

Image Classification
[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet]

Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [R-FCN] [ION] [MultiPathNet] [NoC] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [FPN] [RetinaNet]

Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [ParseNet] [DilatedNet] [PSPNet] [DeepLabv3]

Biomedical Image Segmentation
[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet]

Instance Segmentation
[DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution