Demystifying Object Detection using Deep Learning

Source: Deep Learning on Medium

Demystifying Object Detection using Deep Learning


Object detection has been quite a center of attraction nowadays because of its wide range of applications and advancements in Deep Learning technology. Object Detection is a subdomain of image processing and computer vision that deals with identifying and localizing objects in videos or digital images. The credit for the evolution of object detection goes to the breakthrough in deep learning classification algorithms called CNN- Convolutional Neural Network and Graphic Processing Units that have shown great leads in the development of real-world solutions for computer vision problems like autonomous driving car, face detection and recognition, people detection, and tracking, video surveillance, security system design, etc.

Object detection can be done either using machine learning or deep learning approaches. The difference is that machine learning uses the handcrafted features for identifying the objects in an image or videos while deep learning learns the semantic, high level, deeper features by themselves using back-propagation algorithms and applying convolution neural networks. Due to the tremendous advancement in SOTA algorithms in image classification challenge using the Imagenet dataset, object detection has also been an active area of research.

Problem Definition-Object Detection

The problem definition of object detection is to determine where objects are located in a given image (object localization) and which category each object belongs to (object classification). So the pipeline of traditional object detection models can be mainly divided into three stages:

  1. Informative region selection
  2. Feature extraction
  3. Classification

Limitation of Traditional Object Detection Algorithms

The classical object detection pipeline inferencing is very time consuming and it takes time to detect an object which is not feasible for real-time application for object detection and classification.

For instance, OpenCV AdaBoost uses the sliding window and the image pyramid to generate a detection frame and the R-CNN uses the Selective Search to find a region of interest.

Deep Learning Object Detectors

Based on the type and number of neural network architectures used in designing the detector system, the deep learning object detector can be classified into two categories:

  • Two-stage Detectors — high localization and object recognition accuracy(e.g. Faster-RCNN)
  • Single-stage Detectors — achieve high inference speed(e.g. YOLO, SSD, Retinanet).

Two-stage Detectors

The two-stage detectors as the name suggest is a combination of two Deep Neural Network architectures. The first network called RPN (Region Proposal Network) finds the candidate object bounding boxes and second extracts the features from the candidate bounding proposals. Together these summed up to classify the object. Two-stage detectors have higher accuracy and precision but less inferencing speed because of its complexity in the network.

Two-stage Detector

Single-stage Detectors

A single-stage detector straightforwardly predicts the bounding boxes and class-label probabilities from the input image. It’s kind of a regression problem that predicts less than 100 bounding boxes that speedup the inferencing time and thus is time-efficient and can be used in realtime applications.

Single-stage Detector

The Deep Learning Algorithms for Object Detection:

  • Faster RCNN- a two-stage detector
  • YOLO
  • SSD
  • RetinaNet

Faster RCNN

Faster RCNN is a two-stage object detector which is a combination of two components that is –Fast RCNN plus the RPN — region proposal network, which is a fully convolutional network that generates the high-quality region proposals by predicting the object bounds and objectness scores which are further fed to the fast RCNN for detecting the objects. The RPN uses an attention mechanism of neural network for telling Fast RCNN, where to look for the regions. RPN accelerates the generating speed of region proposals because it shares full-image convolutional features and a common set of convolutional layers with the detection network.

Faster RCNN

YOLO-You only look once

“ A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.”

YOLO object detection takes whole image as an input and process it for finding the detected object bounding box and its belongingness probabilities. The yolo model is extremely fast and it’s base model can process image upto 45 frames per second. The framework works in a way that firstly it takes image as an input then divides the image into n*n grids. Finally, the image classification and object detection is applied to every grid which further predicts the bounding box of the detected object and it’s class probabilites.


SSD(Single-Sot Detection)

SSD is a single-stage object detector. It uses single deep neural network for the detection of the object. The feature map generated by deep neural network discritized the bounding boxes into plethora of default boxes into different aspect ratio and scales which at the time of prediction, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape.

SSD is simply relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component.


RetinaNet is single-stage object detector with new loss function known as focal loss rather than the standard cross entropy loss which solves the problem that one stage-detector have like class imabalance which cause the loss of accuracy in comparision to the two-stage detector.Class imbalance during training is one of the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy and this where comes a new loss function that eliminates this barrier.

Focal Loss: The Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training. The focal loss for the cross-entropy (CE) loss(binary classification) is given below:

Focal Loss

Evolution Roadmap of Object Detection Algorithms:

The object detection algorithms have evolved over time and in no time there are some high standard algorithms that came into existence which has a great role in the application of computer vision applications.

Road Map of Object Detection