Review: MR-CNN & S-CNN — Multi-Region & Semantic-aware CNNs (Object Detection)

Source: Deep Learning on Medium


Using Multi-Region Features and Semantic Segmentation Features for Object Detection

Go to the profile of Sik-Ho Tsang
PASCAL VOC 2012 Dataset

In this story, an object detection approach using MR-CNN & S-CNN, by Université Paris-Est, is reviewed. Two Convolutional Neural Network (CNN) pathways are proposed:

  • Multi-Region CNN (MR-CNN): Object representation using multiple regions to capture several different aspects of an object.
  • Segmentation-aware CNN (S-CNN): Semantic segmentation information is also utilized to improve the object detection accuracy.

In addition, localization mechanism for refining the bounding boxes is also proposed. And this is a 2015 ICCV paper with more than 200 citations. (Sik-Ho Tsang @ Medium)


Outline

  1. Multi-Region CNN (MR-CNN)
  2. Segmentation-aware CNN (S-CNN)
  3. Object Localization
  4. Iterative Localization Mechanism
  5. Results

1. Multi-Region CNN (MR-CNN)

Multi-Region CNN (MR-CNN)

1.1. Network Architecture

  • First, the input image goes through the Activation Maps Module, as shown above, and outputs the activation maps.
  • Region proposals or bounding box candidates are generated using Selective Search.
  • For each bounding box candidates B, a set of regions {Ri}, with i=1 to k, are generated, that’s why it is called multi-region. More details about the choices of multi regions are described in the next sub-section.
  • ROI pooling is performed for each region Ri, the pooled or cropped region goes through the fully connected (FC) layers, at each Region Adaptation Module.
  • Finally, the outputs from all FC layers are concatenated together to form a 1D feature vector which is an object representation of the bounding box B.
  • Here, VGG-16 ImageNet pretrained model is used. The max pooling layer after the last conv layer is removed.

1.2. Region Components

Regions Used in Multi-Region CNN
  • There are two types of regions: Rectangles ((a)-(f)) and rectangular rings ((g)-(j)), as shown above.
  • Original box (a): The one used in R-CNN.
  • Half boxes, (b)-(e): These regions aim to make the object representation more robust with respect to occlusion.
  • Central Regions, (f)-(g): These regions are to make the object representation less interfere from other objects next to it or its background.
  • Border Regions, (h)-(i): These regions aim to make the object representation more sensitive to inaccurate localization.
  • Context Region (j): This region focuses the contextual appearance that surrounds the object.
  • There are two of the reasons why using these regions helps.
  • The masked-out region are set by zero.

Discriminative feature diversification

  • This help diversifying the discriminative factors captured by the overall recognition model. Ablation study is performed here with Model A using (a) and (i) and Model B using (a) and a modified (i) that has the same size as (a). On PASCAL VOC 2007 test set, Model A got 64.1% mAP and Model B got 62.9%, which is 1.2% lower than Model A.

Localization-aware representation

  • The use of multi-region imposes soft constraints regarding the visual content allowed on each type of region for a given candidate detection box.

2. Segmentation-aware CNN (S-CNN)

Multi-Region CNN (MR-CNN) Extended with Segmentation-aware CNN (S-CNN)
  • There are close connection between segmentation and detection. And segmentation related cues are empirically known to often help object detection.
  • Two modules are added: Activation maps module for semantic segmentation-aware features, and region adaptation module for semantic segmentation-aware features.
  • There is no additional annotation used for training here.
  • FCN is used for the activation maps module.
  • The last FC7 layer channels number is changed from 4096 to 512.
Bounding Box (Left), Segmentation Mask Based on Bounding Box (Middle), Foreground Probabilities (Right)
  • Weakly supervised training strategy is used. Artificial foreground class-specific segmentation masks are created using bounding box annotations.
  • More specifically, the ground truth bounding boxes of an image are projected on the spatial domain of the last hidden layer of the FCN, and the ”pixels” that lay inside the projected boxes are labelled as foreground while the rest are labelled as background.
  • After training the FCN using the mask, the last classification layer is dropped. Only the rest of FCN is used.
  • Though it is weakly supervised training, the foreground probabilities shown as above still carry some information, as shown above.
  • The bounding box used is 1.5× larger than the original bounding box.

3. Object Localization

3.1. CNN Region Adaptation Module for Bounding Box Regression

  • An extra region adaptation module is trained to predict the object bounding box.
  • It consists of two hidden FC layers and one prediction layer that outputs 4 values (i.e., a bounding box) per category.
  • Enlarging the candidate box by a factor of 1.3 offers a significant boost.

3.2. Iterative Localization

  • Bt_c: The set of Nc,t bounding boxes generated on iteration t for class c and image X.
  • At the very beginning, t=1, proposals B0_c are generated by Selective Search.
  • For each iteration from t=1,…,T, Bt_c are updated. T=2 is normally enough.

3.3. Bounding Box Voting

  • After the iterative localization, bounding box voting is performed.
  • After the last iteration T, the candidate detections {Dt_c} from t=1 to t=T are merged and formed D_c. D_c = {st_i,c,Bt_i,c} where s is the classification score, B is the corresponding bounding box.
  • First, non-max suppression (NMS) is applied on D_c, using IoU threshold of 0.3, and produces the detections Y_c.
  • Then further refinement based on the weights are performed:
  • The weight w=max(0, s) where s is the classification score.

3.4. Procedures Summary

Object Localization: Candidates (Blue), Ground Truth (Green), and False Positives (Red)
  • Step 1: Initial box proposals (Only shows relevant ones).
  • Step 2: After the first CNN bounding box regression.
  • Step 3: After the second CNN bounding box regression.
  • Step 4: Bounding boxes of those at Step 2 plus those at Step 3.
  • Step 5: Bounding boxes after voting.

4. Results

4.1. PASCAL VOC2007

PASCAL VOC2007 Test Set
  • Proposed approach using original box alone outperforms all other boxes using alone, and outperforms using semantic-aware region alone as well.
PASCAL VOC2007 Test Set
  • Only single original box: 61.7% mAP.
  • MR-CNN: Using multi-region, 66.2% mAP, which shows its novelty.
  • MR-CNN & S-CNN: 67.5% mAP.
  • MR-CNN & S-CNN & Loc: 74.9% mAP, outperforms R-CNN.
PASCAL VOC2007 Test Set
  • Using 0.7 IoU threshold, MR-CNN & S-CNN & Loc still performs the best.
PASCAL VOC2007 Test Set, Trained with Extra Data

4.2. PASCAL VOC2012

PASCAL VOC2012 Test Set
  • Similar to VOC2007, MR-CNN & S-CNN & Loc performs the best with 70.7% mAP.
PASCAL VOC2012 Test Set, Trained with Extra Data

Reference

[2015 ICCV] [MR-CNN & S-CNN]
Object detection via a multi-region & semantic segmentation-aware CNN model

My Previous Reviews

Image Classification
[LeNet] [AlexNet] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [MSDNet]

Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [PSPNet] [DeepLabv3] [DRN]

Biomedical Image Segmentation
[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net]

Instance Segmentation
[SDS] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution
[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]

Human Pose Estimation
[Tompson NIPS’14]