Review: GBD-Net / GBD-v1 & GBD-v2 — Winner of ILSVRC 2016 (Object Detection)

Source: Deep Learning on Medium

Gated Bi-Directional Network, won the ILSVRC 2016 Object Detection Challenge

Go to the profile of Sik-Ho Tsang

This time, GBD-Net (Gated Bi-Directional Network), by Chinese University of Hong Kong (CUHK) and SenseTime, is reviewed. GBD-Net won the ILSVRC 2016 Object Detection Challenge, and it is firstly proposed in 2016 ECCV, with over 30 citations. Then it is extended and published in 2018 TPAMI, with more than 50 citations. (Sik-Ho Tsang @ Medium)

And in this story, mainly the extension, 2018 TPAMI, is presented since it is described much more in details.


  1. Problem
  2. GBD-v1
  3. GBD-v2
  4. Other Techniques
  5. Ablation Study
  6. Comparison with State-of-the-art Approaches

1. Problem

Potential Problems When We Classify the Object in a Candidate Box (Red) with Ground-Truth (Blue)
  • (a): The candidate box can be rabbit or hamster.
  • (b): b2 maybe treated as false positive due to small IoU.
  • (c) and (d): The rabbit head may not necessarily a rabbit, it can be a human.
  • Thus, without information from larger surrounding regions of the candidate boxes, it is hard to distinguish the class labels.
  • First, contextual regions surrounding candidate boxes are a natural help.
  • Besides, surrounding regions also provide contextual information about background and other nearby objects to help detection.
  • Information from surrounding regions are used to improve classification of a candidate box.

2. GBD-v1

2.1. Overall Framework

GBD-v1 Overall Framework
  • The above is the framework of GBD-v1.
  • Fast R-CNN pipeline is used.
  • First, region proposal approaches, such as Selective Search (SS) are to generate a set of region proposals/candidate boxes.
  • After ROI pooling, for each candidate box, it goes through the proposed GBD-v1.
  • The final feature maps are used for classification and bounding box regression, as used in Fast R-CNN.

2.2. Backbone

Inception-v2 as Backbone
ResNet-269 as Backbone
  • And later on, ResNet-269 is also used as backbone. Better backbone, better accuracy.

2.3. ROI Pooling with Different Resolutions and Support Regions

ROI Pooling with Different Resolutions and Support Regions
  • With the candidate box (red), different resolutions and support regions are pooled based on the box.
  • With p = {-0.2, 0.2, 0.8, 1.7} to generate different regions.

2.4. Message Passing Using Gated Bi-Directional Structure

Naive Network Without Message Passing
  • The simplest way is to go through the network for classification for different support regions.
  • But they should be related to each other since they are observing the same object.
Network With Message Passing
  • Thus, bi-directional network is proposed here.
  • One direction is to connect from small-size region to large-size region.
  • Another is to connect from large-size region to small-size region.
  • Therefore, contexts from different regions can be help each other using bi-directional structure.
  • ⨂ is convolution, σ is ReLU (NOT Sigmoid) and cat() is concatenation.
  • However, sometimes a context region may not help for another context region, just like the human with rabbit head as in the first figure.
Network With Message Passing Using Gate Function
  • Gate function is introduced before message passing.
  • Therefore, context-dependent gate function is introduced. The switch will be opened or closed depending on the context.
  • Size of gate filters are 3×3, not 1×1.
  • Sigm is sigmoid function, • is element-wise product, and G is the gate function based on sigmoid.
  • When G = 0, message is not passed.

3. GBD-v2

3.1. Enhanced Version of GBD

  • The GBD network is enhanced.
  • A max pooling is used for merging the information from h¹i and h²i. This can save the memory and computation compared with GBD-v1.
  • Also, an identity mapping layer is also added from h⁰i to h³i. A constant β is multiplied before adding.

4. Other Techniques

4.1. Candidate Box Generation

  • An improved version of CRAFT is used for generating candidate boxes.
  • There are 3 versions.
  • Craft-v1: CRAFT pre-trained from 1000-class ImageNet.
  • Craft-v2: CRAFT used in GBD-v1, the 2016 ECCV paper, but pre-trained from Region Proposal Network (RPN) used in Faster R-CNN.
  • Craft-v3: An modified CRAFT used in GBD-v2, the 2018 TPAMI paper where random-crop is used during training, and multi-scale pyramid is used during testing. Also, positive and negative samples are in RPN training is 1:1. Another set of proposals are added using LocNet.

4.2. Others

  • Multi-Scale Testing: With a trained model, feature maps ar computed on an image pyramid, with the shorter side of the image being {400, 500, 600, 700, 800} and longer size being no greater than 1,000.
  • Left-Right Flip: Adopted in both training and testing.
  • Bounding Box Voting: The bounding box voting in MR-CNN & S-CNN is used.
  • Non-Maximum Suppression (NMS) Threshold: For ImageNet, the NMS threshold was set as 0.3 by default. It is empirically found that 0.4 is a better threshold.
  • Global Context: From the pretrained network, the ImageNet detection data is also treated as an image classification problem. That means the ROI region is the whole image. Then this 200-class image classification score are used to combine with the 200-class object detection score by weighted averaging.
  • Model Ensemble: 6 models are used for ensembling.

5. Ablation Study

5.1. The Effect of Multiple Resolutions

The Effect of Multiple Resolutions Using Inception-v2 as Backbone
  • Using four resolutions obtains the highest mAP of 48.9%.

5.2. CRAFT Versions

Recall Rate on ImageNet val2
  • Modifications on Craft-v2, i.e. Craft-v3, improves the recall rate, as shown above.

5.3. Different Scaling Factor β

Different Scaling Factor β in Controlling the Magnitude of Message on ImageNet val2 Using Inception-v2 as Backbone
  • Different Scaling Factor β in Controlling the Magnitude of Message is also tested. β = 0.1 has the best mAP of 53.6%.

5.4. Different Deep Models as Backbone

Different Deep Models as Backbone (“+I” = Pre-Activation ResNet with Identity Mapping, “+S” = Stochastic Depth (SD))

5.5. 6 Deep Models for Ensembling

6 Deep Models for Ensembling
  • However, diverse backbones have diverse accuracy on different object classes. They can help each other when ensembling.
  • Finally, the above 6 models are chosen, and 66.9% mAP can be obtained.

5.6. Including Other Techniques

Including Other Techniques
  • The details are as above. With above all techniques, it improves mAP from 56.6% to 68%.
  • While GBD technique can only help to improves the mAP from 56.6% to 58.8% which contribute a part of the improvements actually.

6. Comparison with State-of-the-art Approaches

6.1. Object Detection on ImageNet val2

Object Detection on ImageNet val2, sgl: Single Model, avg: Averaged Model (Ensembling)

6.2. Object Detection on ImageNet test set Without Using External Data for Training

Object Detection on ImageNet Test Set
  • GBD-v2 outperforms many state-of-the-art approaches including GoogleNet, ResNet, Trimps-Soushen and Hikvision (1st Runner-Up 2016). (Maybe later on I review Hikvision when I have time.)

6.3. Object Detection on MS COCO

Object Detection on MS COCO
  • Again, GBD-v2 outperforms state-of-the-art approaches such as Faster R-CNN, ION and SSD.


[2016 ECCV] [GBD-Net / GBD-v1]
Gated Bi-directional CNN for Object Detection

[2018 TPAMI] [GBD-Net / GBD-v2]
Crafting GBD-Net for Object Detection

My Previous Reviews

Image Classification
[LeNet] [AlexNet] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [MSDNet]

Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [PSPNet] [DeepLabv3] [DRN]

Biomedical Image Segmentation
[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net]

Instance Segmentation
[SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution

Human Pose Estimation
[DeepPose] [Tompson NIPS’14]