Review: G-RMI — Winner in 2016 COCO Detection (Object Detection)

Source: Deep Learning on Medium


A Guide to Select a Detection Architecture: Faster R-CNN, R-FCN and SSD

Go to the profile of SH Tsang

This time, G-RMI, Google Research and Machine Intelligence, who won the 1st place in 2016 MS COCO detection challenge is reviewed. G-RMI is the team name attending the challenge. It is not a name for a proposed approach. Because they do not have any innovative idea such as modifying the deep learning architecture to win the challenge. The paper name called “Speed/accuracy trade-offs for modern convolutional object detectors” also gives us some hints that, they systematically investigated on different kinds of object detectors and feature extractors. Specifically:

They also analysed the effects of other parameters such as input image sizes and number of region proposals. Finally, an ensemble of several models achieved the state-of-the-art results and won the challenge. And it is published in 2017 CVPR with more than 400 citations. (SH Tsang @ Medium)


Outline

  1. Meta-architectures
  2. Feature Extractors
  3. Accuracy vs Time
  4. Effect of Feature Extractor
  5. Effect of Object Size
  6. Effect of Image Size
  7. Effect of the Number of Proposals
  8. FLOPs Analysis
  9. Memory Analysis
  10. Good localization at .75 IOU means good localization at all IOU thresholds
  11. State-of-the-art Detection Results on COCO

1. Meta-architectures

The object detectors are named as meta-architectures here. Three meta-architectures are investigated: Faster R-CNN, R-FCN, and SSD.

Abstract Architecture

SSD

  • It uses a single feed-forward convolutional network to directly predict classes and anchor offsets without requiring a second stage per-proposal classification operation.

Faster R-CNN

  • In the first stage, called the region proposal network (RPN), images are processed by a feature extractor (e.g., VGG-16), features at some selected intermediate level (e.g., “conv5”) are used to predict class-agnostic box proposals.
  • In the second stage, these (typically 300) box proposals are used to crop features from the same intermediate feature map (ROI pooling) which are subsequently fed to the remainder of the feature extractor (e.g., “fc6” followed by “fc7”) in order to predict a class and class-specific box refinement for each proposal.

R-FCN

  • Similar to Faster R-CNN, there is RPN in the first stage.
  • In the second stage, positive-sensitive score maps are used such that crops (ROI pooling) are taken from the last layer of features prior to prediction. This makes the per-ROI operation cost become very low as nearly all operations are shared before ROI pooling.
  • Thus, it achieves comparable accuracy to Faster R-CNN often at faster running time.

2. Feature Extractors

Six feature extractors are tried: VGG-16, ResNet-101, Inception-v2, Inception-v3, Inception-ResNet-v2 and MobileNetV1.

Top-1 classification accuracy on ImageNet
  • Different feature extractors, different layer is used for extracting features for object detection.
  • Some modifications are made such as, dilated convolutions are used, or making max pooling stride smaller, for some feature extractors so that the stride size is not too small after feature extraction.

3. Accuracy vs Time

Accuracy vs Time, The dotted Line is Optimality Frontier
Test-dev performance of the “critical” points along our optimality frontier
  • Colors: Feature Extractors
  • Marker shapes: Meta-architectures

3.1. General Observations

  • R-FCN and SSD are faster on average.
  • Faster R-CNN is slower but more accurate, requires at least 100ms per image.

3.2. Critical Points on Optimality Frontier

Fastest: SSD w/MobileNet

Sweet Spot: R-FCN w/ResNet or Faster R-CNN w/ResNet and only 50 proposals

  • There is an “elbow” in the middle of the optimality frontier occupied by R-FCN models using ResNet feature extractors.
  • This is the best balance between speed and accuracy among the model configurations.

Most Accurate: Faster R-CNN w/Inception-ResNet at stride 8

  • Faster R-CNN with dense output Inception-ResNet-v2 models attain the best possible accuracy on our optimality frontier.
  • Yet, these models are slow, requiring nearly a second of processing time.

4. Effect of Feature Extractor

Accuracy of detector (mAP on COCO) vs accuracy of feature extractor
  • Intuitively, stronger performance on classification should be positively correlated with stronger performance on COCO detection.
  • This correlation appears to only be significant for Faster R-CNN and R-FCN while the performance of SSD appears to be less reliant on its feature extractor’s classification accuracy.

5. Effect of Object Size

Accuracy stratified by object size, meta-architecture and feature extractor, image resolution is fixed to 300
  • All methods do much better on large objects.
  • SSDs typically have (very) poor performance on small objects, but still SSDs are competitive with Faster R-CNN and R-FCN on large objects.
  • And later on, there is DSSD to address the small object detection issue.

6. Effect of Image Size

Effect of image resolution
  • Decreasing resolution by a factor of two in both dimensions consistently lowers accuracy (by 15.88% on average) but also reduces inference time by a relative factor of 27.4% on average.
  • High resolution inputs allow for small objects to be resolved.
  • High resolution models lead to significantly better mAP results on small objects (by a factor of 2 in many cases) and somewhat better mAP results on large objects as well.

7. Effect of the Number of Proposals

Faster R-CNN (Left), R-FCN (Right)

We can output different number of proposals at RPN (the first stage). Fewer proposals, faster running time, or vice versa.

Faster R-CNN

  • Inception-ResNet, which has 35.4% mAP with 300 proposals can still have surprisingly high accuracy (29% mAP) with only 10 proposals.
  • The sweet spot is probably at 50 proposals, where we are able to obtain 96% of the accuracy of using 300 proposals while reducing running time by a factor of 3.

R-FCN

  • The computational savings from using fewer proposals in the R-FCN setting are minimal.
  • This is not surprising because as mentioned, per-ROI computation cost is low for R-FCN due to shared computation by positive-sensitive score maps.

Comparison between Faster R-CNN and R-FCN

  • At 100 proposals, the speed and accuracy for Faster R-CNN models with ResNet becomes roughly comparable to that of equivalent R-FCN models which use 300 proposals in both mAP and GPU speed.

8. FLOPs Analysis

FLOPs vs Time
  • For denser block models such as ResNet-101, FLOPs/GPU time is typically greater than 1.
  • For Inception and MobileNet models, this ratio is typically less than 1.
  • Perhaps, factorization reduces FLOPs, but adds more overhead in memory I/O or potentially that current GPU instructions (cuDNN) are more optimized for dense convolution.

9. Memory Analysis

Memory (Mb) vs Time
  • High correlation with running time with larger and more powerful feature extractors requiring much more memory.
  • As with speed, MobileNet is the cheapest, requiring less than 1Gb (total) memory in almost all settings.

10. Good localization at .75 IOU means good localization at all IOU thresholds

Overall COCO mAP (@[.5:.95]) for all experiments plotted against corresponding mAP@.50IOU and mAP@.75IOU
  • Both mAP@.5 and mAP@.75 performances are almost perfectly linearly correlated with mAP@[.5:.95].
  • mAP@.75 is slightly more tightly correlated with mAP@[.5:.95] (with R² > 0.99), so if we were to replace the standard COCO metric with mAP at a single IOU threshold, IOU=.75 is likely to be chosen.

11. State-of-the-art Detection Results on COCO

11.1. Ensembling and Multicrop

Summary of 5 Faster R-CNN single models
  • Since mAP is the main objective in COCO detection challenges, the most accurate though time-consuming Faster R-CNN is considered.
  • The diverse results encouraging ensembling.
Performance on the 2016 COCO test-challenge dataset.
  • G-RMI: With the above 5 models ensembled and multicrop yielded the final model. It outperforms the winner in 2015 and 2nd place in 2016.
  • The winner in 2015 uses ResNet + Faster R-CNN + NoCs. Trimps-Soushen, Faster R-CNN + ensemble multiple models + improvements from other papers. (There is a paper for NoCs but there is no details about Trimps-Soushen.)
  • Note: There is no multiscale training, horizontal flipping, box refinement, box voting, or global context.
Effects of ensembling and multicrop inference.
  • 2nd Row: 6 Faster RCNN models with 3 ResNet-101 and 3 Inception-ResNet-v2.
  • 3rd Row: Diverse ensemble results as in the first table in this section.
  • Thus, it is encouraging for diversity did help against a hand selected ensemble
  • And ensembling and multicrop were responsible for almost 7 points of improvement over a single model.

11.2. Detections from 5 Different Models

Beach
Baseball
Elephants