Original article was published by Uri Almog on Deep Learning on Medium
Object Detection With Deep Learning: RCNN, Anchors, Non-Maximum-Suppression
Object detection in images is a common task in computer vision. Its use cases vary from missile guidance to automated production lines, console games and cleaning robots. Object detection algorithms in computer vision have been around long before their migration to deep learning:
In the 2001 paper Rapid Object Detection Using a Boosted Cascade of Simple Features by Viola and Jones, they present a method to slide rectangular hand-crafted filters over the image to extract features correlated with facial features. The extracted features are fed to a classifier which learns which features are more significant than others, but does not modify the filters themselves. In fact, this task is termed ‘image classification’ in today’s jargon, because it lacks the localization prediction of the object in the image, but it demonstrates how this process is done, and in principle — you could split the original image into many patches and run the proposed classifier on each one of them and in that manner localize the object (but, you would have to use many assumptions on the size of objects in images).
Before we dive into the first implementations of object detection with deep learning, let’s formulate the problem we’re trying to solve:
Given an image with an object contained in a box GT (short for Ground Truth) we would like our model to predict a bounding box BB such that:
IoU(BB, GT) > threshold
That task is called localization. Most detectors are also required to predict the class of the object in the bounding box. That task is called classification.
How are these tasks performed jointly?
It is assumed you are familiar with the basic concepts of object classification, convolution layers and training, so we will only briefly state the main attributes of classification networks:
The input to a classification network is assumed to contain a single object. The classifier, except its last two layers, is comprised of a structure called Feature Extractor (F.E.). There are many variations for the F.E. architecture (e.g. ResNet, DenseNet, Mobilenet etc.) but basically a F.E. is a sequence of convolution layers with activation functions producing multiple feature maps that flow down the network, and occasionally a spatial compression operation called maxpooling (for the purpose of keeping the amount of computation reasonable during the feature expansion). During training the F.E. learns to represent significant attributes in different feature maps. While the first layers are limited to representing only simple attributes like borders and simple shapes, the network learns to combine them in complex ways so that the final features of the F.E. can represent high-level attributes such as ‘this area has metallic texture’, ‘this area has wheels’, ‘this area has pointy ears’, ‘this area has furry texture’ etc.
The output of the feature extractor still preserves some spatial information, although at low resolution. The final segment of the classifier takes the final feature maps and replaces each map with its spatial-average value (an operation called global average-pool). This operation destroys the spatial information and gives the average intensity of each attribute in the original image, regardless of location. The last layer is a fully-connected (F.C.) layer that learns the best coefficients for linearly combining the attribute intensities in a way that predicts the object class. For example, the coefficients the classifier will learn for combining the ‘metallic texture’ and ‘this area has wheels’ features will be such that a strong intensity in both features will result in the class prediction: ‘car’.
One of the earliest models that addressed object detection using deep learning is the RCNN. Introduced in 2014 by Girshick et al., RCNN (Regions with CNN features) divides this task into several modules, each in charge of a simpler sub-task:
- The first module, which is non-neural, is a Selective Search algorithm (presented by Uijlings et al. in 2012). This non-trainable algorithm scans the image and generates crops which are likely to contain an object, based on multiple hand-crafted attributes. The algorithm assigns each of these Regions of Interest (RoI’s) with an “objectness score”, corresponding to the confidence that this patch indeed contains an object with a high enough IoU. These RoIs already localize the object by definition.
- A feature extractor (which was harvested from an already trained classification network) is then fed with the resized RoI patches to generate classification information for each patch separately.
- A classification head predicts the class probabilities for the object in the RoI.
At the time of its publication, RCNN was state-of-the-art in object detection metrics and scored higher than any existing (non-CNN-based) method for object detection. However, it had several drawbacks:
- The region proposal mechanism was non-trainable.
- The selective search produced thousands of RoI’s with high overlapping for each image. This means that each pixel passed through the feature extractor performing the same computations thousands of times per image. The result was a very long processing time: a single image took 47 seconds to process.
Fast-RCNN also starts with a non-trainable algorithm that generates proposals for objects. But the main achievement is that the image only passes once through the feature extractor. The patches corresponding to the RoI’s are cropped from the F.E. output and fed into two parallel branches: the first one is a classifier head, containing several convolutions followed by a global average pool and a fully connected layer, and the second branch is a small network trained with a regression loss — that predicts the scale and center of the object contained in that RoI. In a sense — this branch finetunes the coordinates proposed by the original selective search. The clever way to avoid multiple passes through the F.E. gained a speedup of x150 (0.32 seconds per image) compared with RCNN. Note that there is still some inefficiency, since the overlapping pixels in the F.E. output are processed from scratch for each RoI, but it’s nevertheless a major improvement.
We won’t elaborate on loss mechanisms, but I would like to note that the box regression process is philosophically different from classification, in at least one sense: While in classification training we feed the model with 1-hot GT vectors and it learns to reproduce this vector as best it can, in the box regression task, where the object size is, let’s say, (30, 55) pixels, the values (30, 55) appear nowhere in the ground truth, and yet the network learns to produce them in order to achieve a high IoU with the ground truth object.
At the heart of the Faster-RCNN is the understanding that the representation power of the feature extractor is strong enough, so that an exterior RoI generator is not required. The entire image is fed once to a feature extractor and its output diverges to a classification head and a Region Proposal Network (RPN) which generates proposals and objectness scores. This time the box regression head receives the F.E. output for the entire image, and not crops corresponding to RoI’s. A non-neural code loops over the proposals — for each proposal the F.E. output corresponding with that patch is sent to a classifier.
Comparing results of different architectures is a tricky business: Many test sets have been developed and updated over the years for detection challenges, and different researchers may quote their model’s results on different datasets. We can fairly compare models by using an up-to-date framework that supports the training and testing of various models. Once such framework is GluonCV (developed by Amazon). In their Detection Model Zoo, are many models that were trained and tested in comparable conditions.
As one can see, Faster RCNN models (with varying feature extractors) achieves the highest mean Average Precision. However, these architectures are extremely slow. Other architectures, like YOLO and CenterNet also achieve high accuracy but are much faster. The SSD model family (using VGG and Mobilenet feature extractors) are superseded by these model.
Now we are ready to discuss the mechanism that actually performs the bounding box prediction. The most commonly used method uses anchors, which are priors, or predefined basic box predictions.
Each output cell in the box-regression head of a detector is assigned with several (usually 3 or 5) anchors differing in shape. An anchor is a (W, H) box (W, H are different for the 5 anchors) centered at the center of the output cell.
The model learns the necessary transformations (offset from pixel center, width and height scaling relative to anchor) for those anchors in order to capture objects successfully.
The exact sizes and aspect ratios of the anchors are hyper parameters of the model, and can be determined after a statistical research of the object shape distribution in the training data (via a k-means analysis, for example).
During training the detector learns that several features that are spatially close to each other should result in a box that encompasses them (rewarding the network with a low loss). Several anchor geometries are used simultaneously to help the network capture objects with different aspect ratios, e.g. a short and wide anchor will likely learn to capture a car more easily than a toll and slim anchor.
Using N anchors enables the network to predict up to N objects centered in the same pixel (e.g. a cat napping on a couch) without having to compromise and predict a cat-girl class mix.
One drawback of using anchors is that each object gives rise to multiple bounding boxes from which we need to pick the best-fitting one.
An even more severe drawback is that since each spatial output cell is forced to predict bounding boxes even if the confidence is extremely low, we end up with hundreds or thousands of bounding boxes per image, by design.
For example — Assume an 80-class detector with input shape 224×224 and output shape 7×7 cells, each having 3 anchors. then each image will result in 7x7x3x80=11,760 bounding boxes (there is a bounding box copy per class. See the next section for the reason).
The commonly used method to sift through the bounding boxes and pick only the interesting ones is called NMS.
NMS — Non-Maximum Suppression
This process is considered a part of the post processing.
NMS is a greedy algorithm that loops over all the classes, and for each class it checks for overlaps (IoU — Intersection over Union) between all the bounding boxes. If the IoU between two boxes of the same class is above a certain threshold (usually 0.7), the algorithm concludes that they refer to the same object, and discards the box with the lower confidence score (which is a product of the objectness score and the conditional class probability).
The loop is performed over the classes, since we want to keep the ability to detect classes that aren’t with the highest score, in case the highest-scoring class copy will be discarded due to its high overlap with another box.
NMS is very expensive computationally: If there are n boxes then the algorithm performs O(n²) comparisons and quickly becomes a serious bottleneck, especially if the input image is large and the processor is weak.
Several ways have been invented to mitigate this NMS bottleneck. One is to sift through the boxes and remove the ones with very low confidence level even before the NMS. This helps get rid of about 90% of the boxes on VOC-Pascal dataset, for example.
Another approach is to abandon the anchor methodology and use a prediction architecture that inherently generates less proposals. CenterNet is such an architecture (which I’m personally fond of) and I will elaborate on it in a future post.