Source: Deep Learning on Medium
Encoder-Decoder Architecture Using TDM with Faster R-CNN
In this story, TDM (Top-Down Modulation) is shortly reviewed. It is found that by combining high-level and low-level features using TDM, many hard objects can be detected, and thereby a significant boost on the COCO benchmark. I chose to review this paper because some of the later state-of-the-art approaches such as YOLOv3 and RetinaNet selected TDM for comparison. This shows that TDM has its certain importance in the aspect of object detection. It is a 2017 arXiv tech report with over 70 citations. (SH Tsang @ Medium)
- TDM Network
- Details of TDM
- Ablation Study
1. TDM Network
- At the bottom-up path, it is a standard conv path for feature extraction. However, the feature maps getting smaller and smaller and the location information is lost.
- At the Top-down path, TDM is used for enlarge the feature map gradually with the help of feature map at bottom path, as we can see at the basic TDM module.
- Finally, we can have ROI proposal and ROI classifier.
- Actually, there were many concurrent works working on encoder-decoder architecture at that moment. For example, DSSD for object detection, SharpMask for instance segmentation, U-Net for biomedical Image Segmentation, and RED-Net for image restoration. And TDM is the one based on Faster R-CNN for object detection.
2. Details of TDM
2.1. TDM Structure
- Bottom-up feature goes through 3×3 conv (L2), this is called lateral module.
- Top-down feature goes through 3×3 conv (T3,2) and then up-sampled to match the higher resolution if necessary. (No upsampling by T4.)
- They are then concatenated and go through 1×1 conv (T2out) to become the output feature of TDM.
- The output feature will then go to the next TDM as Top-down feature.
- Pre-trained bottom-up network is used.
- And the top-down TDM is progressively added one by one. That means, for example, (L4, T5,4) is added and then trained for object detection. After that, (L3, T3,3) is added and then trained, and so on.
3. Ablation Study
3.1. How low should the Top-Down Modulation go?
- Skip-Pool: Similar to ION, instead of using top-down modules, features are obtained at different layers, then L2-normalized, concatenated and scaled back.
- For VGG-16+TDM, there are degradation from 29.9% to 29.8% mAP when one more TDM is added. I guess there is difficulty on convergence due to absence of skip connection.
- For ResNet-101+TDM, 35.7% mAP is obtained.
- For Inception-ResNet-v2+TDM, 38.1% mAP is even achieved.
3.2. No Lateral Modules
- Large margin is obtained when lateral module is used which shows that lateral module is important.
- Pre-training on COCO is a bit better.
4.1. Overall AP
- 128 ROIs for RPN and RCN are used.
- VGG-16+TDM (28.6%) is better than SharpMask (25.2%), which has similar architecture.
- Using ResNet-101+TDM, 35.2% mAP is obtained.
- Using Inception-ResNet-v2+TDM, 37.3% mAP is obtained.
4.2. Improved Localization
- If we look at AP⁷⁵, comparing TDM (bottom) with the baseline Faster R-CNN variants (Middle), AP⁷⁵ is improved by large margin.
4.3. Improvement on Small Objects
- If we look at AP^S, comparing TDM (bottom) with the baseline Faster R-CNN variants (Middle), AP^S is improved by large margin as well.
4.4. Qualitative Results
[2017 arXiv] [TDM]
Beyond Skip Connections: Top-Down Modulation for Object Detection
My Related Reviews
[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet]