Faster R-CNN : Object Detection

Original article was published by sai venkatesh on Artificial Intelligence on Medium


Faster R-CNN : Object Detection

Introduction

Object Detection involves the identification or classification of an image along with its Segmentation. Segmentation is achieved by drawing a bounding box over the object of interest. Object Detection is typically used for locating objects in an image. The most popular methods used for detecting objects employs either the R-CNN or the YOLO architecture.

The R-CNN which was initially designed, has evolved over the years to Fast R-CNN and then on to the Faster R-CNN architecture. Each is an improvement over its predecessor with respect to runtime speed and raw performance. The goal of this article is to cover the working and the main components of the Faster R-CNN architecture.

Faster R-CNN paper was first published in 2015. Faster R-CNN was devised to be an improvement over its predecessor, Fast R-CNN. Fast R-CNN uses selective search algorithm to propose the regions where an object could be found. But in Faster R-CNN, the proposals were generated as part of the convolution operation itself. This enhanced both its efficiency and speed in terms of performing the overall task of object detection.

The main components of Faster R-CNN are the Region Proposal Network and the ROI pooling coupled with a classifier and regressor head to obtain the predicted class labels and locations.

We will now explore each of the components of the Faster R-CNN.

Figure 1 : Faster RCNN Architecture

Anchors

Anchors are potential bounding box candidates where an object can be detected. They are predefined before the start of training, based on a combination of aspect ratios and scales and placed throughout the image. Faster R-CNN utilizes 3 aspect ratios and 3 scales generating 3*3 = 9 combinations.

Lets consider an aspect ratio of 0.5 and scale combinations of [8,16, 32]. The final downsampled image stride size, after passing the image through a conv-pool VGG network is 16. Now the resulting combinations with the above would be :-

Figure 2 : Anchors

Here we get 3 rectangles with the width larger than its height. Similarly by considering a ratio of 1, we get 3 variations of a square and with a ratio of 2, there would be 3 variations of a rectangle with the height larger than the width. This forms the base anchor with a total of 9 variations.

Now that we have obtained the base anchor, the next step would be to generate all the possible anchors for the image.

If after passing the image through a VGGNet we get a height, width, stride of H, W, F then the original image would be of dimensions (H x F, W x F).

The base anchor centre can be situated in all of these. [(0,0), (F, 0), (0, F), (F, F) .. (H x F, W x F)] This results in a total number of H x W x 9 image anchors.

Each anchor is denoted by (x1, y1, x0, y0) where x1, y1 are the top left corners and x0, y0 are the bottom right corners of a box.

Region Proposal Network (RPN)

The RPN is responsible for returning a good set of proposals from the generated image anchors in the previous step. The RPN consumes the feature map generated by the image feature extractor and passes it through two parallel convolution layers to give two sets of outputs.

The first convolution layer gives the bounding box regressor outputs. The purpose of these outputs is not to pinpoint the direct locations of the bounding boxes itself, but rather to predict the offset and scales, which are to be applied over the image anchors to refine the predictions.

The RPN also outputs a classification output indicating the probability of the bounding boxes to be foreground or background. Since there are a lot of image anchors, there needs to be a method to select the most probable boxes where an object can be detected and to discard the remaining. Leveraging the fact that a lot of anchors overlap with one another, Non-Maximum Suppression is done to achieve this.

Non-Maximum Suppression

Non-Maximum Suppression (NMS) is applied over the predicted bounding boxes using their predicted scores as criteria for filtration. Before passing it to the NMS, the Regions of Interest (ROI’s) are preprocessed by clipping and removing those with height, width beyond a threshold value. The top ROIs alone are taken and sorted by their confidence scores.

NMS works by taking all the ROIs one by one and is compared with every other ROIs. If the IOU of this comparison results in a value greater than a predefined threshold, then that latter ROI is popped from the list. This ensures that not too many redundant boxes crowd the image. This process repeats until there are no more boxes remaining.

The top ROIs from the final list alone is passed on to the next step.

ROI Pooling and VGG-Head

ROI pooling is done next over these selected ROIs to produce fixed-size feature maps. ROI Pooling splits each feature map into multiple regions and applies Max-Pooling on them. This is mainly done to ensure that the same feature map for all the proposals can be used and thereby facilitates the passing of the entire image directly to the VGG-Head in the next step.

The VGG-Head predicts the bounding box offsets/scales to be applied over them for further refining and their corresponding scores indicating the probability for each of the predefined class labels.

Losses

There are four sets of losses in total for the model to train. These 4 losses are obtained from 2 layers:- RPN and ROI Pool/Head layer.

Faster R-CNN uses Smooth L1-Loss for the regressors and Cross-Entropy for the classifier to calculate the loss. The loss from these four layers are accumulated and backpropagated to train the overall model.

RPN

The IOU between the ground truth bounding box (GTBb) and the image anchors are calculated.

  1. Bounding Boxes Offsets/Scales:-

The predicted offsets/scales are decoded with the image anchors to get the final bounding boxes. Similarly, the image anchors are encoded with its maximum IOU GTBb to obtain the offsets/scale differences.

2. Bounding Boxes Confidence Scores:-

The predicted confidence scores indicate the presence of an object and whether it is background or foreground. For the ground truth labels, those with IOUs, above a positive threshold are marked with label 1 and below a negative threshold with label 0. The remaining are marked as -1 indicating that they can be ignored.

ROI

The IOU between the ground truth bounding box (GTBb) and the selected ROIs are calculated.

  1. Bounding Boxes Offsets/Scales:-

The VGG-head predicts the offsets/scales for the selected ROIs. For obtaining the ground truth offsets/scales differences, the selected ROIs are encoded with its corresponding maximum IOU GTBb.

  1. Bounding Boxes Labels

The VGG-head predicts the labels for the selected ROIs. Each selected ROI has to be assigned a ground truth class label for calculating the loss. Each GTBb already has been assigned a label. Using the same, each of the selected ROIs corresponding maximum IOU GTBb is calculated, and that GTBb’s label is associated with the ROI.

Conclusion

This concludes this article which highlights the key components of the Faster R-CNN architecture. Proceeding next with the code in https://github.com/Sai-Venky/FasterRCNN would provide a solid foundation on the overall implementation of this architecture.

Happy reading…

References

  1. https://www.alegion.com/faster-r-cnn
  2. https://leimao.github.io/blog/Bounding-Box-Encoding-Decoding/
  3. https://medium.com/@smallfishbigsea/faster-r-cnn-explained-864d4fb7e3f8