Deep Learning for Object Detection: From the start to the state-of-the-art (2/2)

A Video Demo of Object Detection

Welcome to Part 2 of Deep Learning for Object Detection: From the start to the state-of-the-art! In Part 1 we reviewed the R-CNN based models since they hold the core developments used in object detection CNNs. Be sure to read that first to have some background info for this post.

There are a few main deep CNN architectures for object detection from which other state-of-the-art techniques are built upon. We’re going to review them in detail here and compare and contrast their tradeoffs. They are Faster R-CNN, Single Shot Detector (SSD), and R-FCN.

Faster R-CNN

Faster R-CNN

Faster-RCNN tends to be the most accurate, with the drawback of being a bit slower than the others. It works by first running the input image through a classification CNN (such as VGG, ResNet, etc) in order to extract high-level features from the image. It uses these features to create object proposals by regressing to the ground-truth bounding boxes. Using the bounding box coordinates, it extracts a subset of the high-level features from the final layer of the classification CNN. It then reshapes these subset features, and passes them to a classifier with fully connected layers to classify the objects!

Single Shot Detector (SSD)

Single Shot Detector (SSD)

Faster R-CNN above did the detection and classification in two stages. First, it regressed and predicted the positions of the bounding boxes. It then passed these bounding boxes to a separate box classifier to classify the objects. The Single Shot Detector takes a slightly different approach by doing this all in one stage. Given the features extracted by the classification CNN, SSD directly predicts both the bounding box and the classification of that box (rather than first predicting boxes, then passing them to a separate part for classification). It tends to be slightly faster than Faster R-CNN while having slightly lower performance.

Region-based Fully Convolutional Networks (R-FCN)

Region-based Fully Convolutional Networks (R-FCN)

With Faster-RCNN, the proposal generator often produced hundreds of bounding boxes. All of these bounding boxes were passed through fully connected layers for classification. R-FCN instead proposes to crop the subset features using the boxes from the last fully connected layer, right before the softmax classification. This approach of pushing cropping to the last layer minimizes the amount of per-region computation that must be done. SSD tends to be faster than those above while still achieving comparable accuracy performance.

Conclusion

There you have it! Object detection from the start to the state-of-the-art using deep learning.


Deep Learning for Object Detection: From the start to the state-of-the-art (2/2) was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium