Deep Learning for Object Detection: From the start to the state-of-the-art (1/2)

Object Detection

Deep Learning has taken over nearly every domain of Computer Vision. It’s pretty much become standard to go straight to a deep Convolutional Neural Network (CNN) when doing most Computer Vision tasks. One task that CNNs have done particularly well on is object detection. The goal of object detection is to draw a bounding box i.e a rectangle around each important object in an image and then classify that object (is it a dog, car, person etc).

In this blog post we’re going to review the major and most important CNNs that have been successfully applied to object detection, from the start to the state-of-the-art. In Part 1, we’ll start off by reviewing the advancements of R-CNN, Fast R-CNN, and Faster R-CNN since those models developed many of the core ideas used in object detection CNNs. In Part 2 we’ll take a look the current state-of-the-art models in object detection: Faster R-CNN, Single Shot Detector (SSD), and R-FCN)


R-CNN was the first notable CNN to be trained for object detection. In the year prior to its publication, CNNs had been applied to the image classification problem successfully for the first time with AlexNet. If you break things down object detection really has two main steps: 1. detect objects and draw the bounding boxes around them 2. classify the individual objects in each box. Step 2 of object detection is thus just an image classification!

The authors of R-CNN proposed to use a state-of-the-art image classification network (AlexNet) to extract features from the objects in the boxes. These features would be unique to each object and thus be very useful in distinguishing between them for classification. Since each object had a bounding box around it from the detection stage all you had to do was crop those sub images out and pass them to AlexNet for the feature extraction. A Support Vector Machine (SVM) model can then be trained to perform the classification. Easy-peasy!

R-CNN Pipeline

The steps of R-CNN are:

  1. Generate a set of proposals for bounding boxes. A technique called Selective Search is used to propose a bunch of bounding boxes (usually around 2000) that are likely to have objects in them
  2. Run the images in the bounding boxes through a pre-trained AlexNet image features that can be used for classification.
  3. Finally, train an SVM to classify the objects using the features extracted from the classification CNN.
  4. Run the bounding box coordinates through a linear regression model to output tighter coordinates for the box once the object has been classified. The classification of the object is used as extra information to tighten the bounding box.

Fast R-CNN

R-CNN was a massive breakthrough in object detection. But, it does come with a couple of drawbacks:

  • It requires a forward pass of the CNN (AlexNet) for every single bounding box proposal for every single image (that’s around 2000 forward passes per image!).
  • It has to train three different models separately; the CNN to generate image features (AlexNet), the classifier that predicts the class (SVM), and the regression model to tighten the bounding boxes. This makes the pipeline extremely hard to train.

To speed things up Fast R-CNN makes a key insight. In R-CNN we are running our image feature extraction network around 2000 times (i.e for every bounding box proposal). But we are really running that feature extraction CNN on the same image, just different parts of it. So why not only run our CNN once for the entire image and then just crop out the features we need according to the bounding box proposals! That’s exactly what Fast R-CNN did with the following steps:

Fast R-CNN Steps
  1. Generate a set of proposals for bounding boxes. A technique called Selective Search is used to propose a bunch of bounding boxes (usually around 2000) that are likely to have objects in them.
  2. Get image features by running the input image through our pre-trained classification CNN. This is done only once for the current input image.
  3. The CNN features for each object are obtained by selecting a corresponding bounding box of features from the CNN’s extracted feature map. This is done using RoI pooling which references the selective search in the beginning. Then the features in each bounding box are pooled (usually using max pooling). So all it takes us is one pass of the original image as opposed to ~2000!
  4. The second insight of Fast R-CNN is to jointly train the CNN, classifier, and bounding box regressor in a single model. Earlier we had different models to extract image features (CNN), classify (SVM), and tighten bounding boxes (regressor). Fast R-CNN instead used a single network to compute all three and uses the same image features for all three tasks.

Faster R-CNN

In the previous model, Fast R-CNN runs the image classifier once per image instead of once per bounding box, saving massively on computations. Once that part was taken care of a new bottleneck could be seen: the bounding box proposal step. Selective Search was the slowest step in the pipeline usually computing around 2000 bounding box proposals.

The insight of Faster R-CNN was that bounding box proposals depended on features of the image that were already calculated by the feature extraction CNN. So why not reuse those same CNN results for bounding proposals instead of running a separate selective search algorithm? By performing most of the computations involved in both of the major steps (detection and classification), Faster R-CNN fully leverages the CNN to do all of the heavy lifting.

Faster R-CNN Pipeline
  1. A single CNN is used to both carry out both bounding box proposals and classification. This way only one CNN needs to be trained and we get region proposals almost for free! The authors write: Our observation is that the convolutional feature maps used by region-based detectors, like Fast R- CNN, can also be used for generating region proposals [thus enabling nearly cost-free region proposals].
  2. Faster R-CNN adds a Fully Convolutional Network on top of the features of the CNN creating what’s known as the Region Proposal Network.
  3. The entire system can thus be trained end-to-end optimizing all of the sub-tasks together.


That concludes our review of the R-CNN models and Part 1 of this post. Check out Part 2 where we review the current state-of-the-art models in object detection!

Deep Learning for Object Detection: From the start to the state-of-the-art (1/2) was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium