Analysis of Deep Learning-based Object Detection

Original article was published on Deep Learning on Medium

Analysis of Deep Learning-based Object Detection

Survey Author: Licheng Jiao, Fan Zhang, Shuyuan Yang, Lingling Li, Zhixi Feng, Rong Qu, IEEE

Article Link:

Date published: 15th May 2020


Oftentimes, I don’t recall things such as keys, mobile phones where I have last kept them and spend a good amount of time searching for them. I believe ML can help solve these issues in a better way. Object detection has become more popular as it is used in various applications in the real world problems such as monitoring, security, autonomous driving, transportation surveillance, face detection, etc.. Deep Convolutional Neural networks and the GPU’s computing power to run these models are the main factors which play a major role for its development. In this article, let’s look into various algorithms which are used for object detection.

With the development of deep learning and the continuous improvement of computing power, great progress has been made in the field of general object detection. Lets introduce ourselves to some representative object detection architectures for beginners to get started in this domain .

There are mainly two kinds of object detectors, Two-stage detectors — the most representative one such as Faster R-CNN. The other is a One-stage detectors such as YOLO, SSD.

Two-stage detectors have high localization and object recognition accuracy, whereas the One-stage detectors achieve high inference speed. On top of that One-stage detectors propose predicted boxes from input images directly without a region proposal step. This made them time efficient and so can be used for real-time detection.

Two Stage Detectors


It is a region based CNN detector used for object detection. It consists of four different modules. The first module generates category-independent region proposals. Second module extracts a fixed-length feature vector from each region proposal and the third module is a set of class-specific linear SVMs to classify the objects in one image. The last module is a bounding-box regressor for precisely bounding box prediction.

Fig. 1. Basic architecture of two-stage detectors, which consists of a region proposal network to feed region proposals into classifier and regressor.

Fast R-CNN

Fast R-CNN as the name implies a faster version of R-CNN. R-CNN takes a long time on classification of SVM as they undergo forward pass for each region proposal without sharing computation, which is an expensive way for training. Fast R-CNN solves this problem. Instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map.

From the convolutional feature map, we extract the features from an entire input image and then by using region of interest (ROI) pooling layer, we pass them to get the fixed size features as the input of the classification to feed it to the fully connected layer.

Significant time can be saved for CNN processing and large disk storage can be saved by using Fast R-CNN. It is a one-stage end-to-end training process using a multi-task loss on each labeled RoI to jointly train the network. For faster detection truncated SVD can be used to accelerate the forward pass of computing the fully connected layers.

Faster R-CNN

Fast R-CNN uses selective search to propose RoI, which is slow and needs the same running time as the detection network. Faster R-CNN replaces it with a novel RPN (region proposal network) that is a fully convolutional network to efficiently predict region proposals with a wide range of scales and aspect ratios RPN. This was made possible by sharing full-image convolutional features and a common set of convolutional layers that accelerates the generating speed of region proposals.

A novel approach for detection of objects of different sizes is to use multi-scale anchors as a reference. The anchors can greatly simplify the process of generating different size region proposals without the need for multiple scales of input images or features.

Fig.2. Different RNN models comparisons

One Stage Detectors


You Only Look Once (YOLO) is mainly used in real-time detection of full images and webcam. Primarily, because it predicts less than 100 bounding boxes per image while Fast R-CNN uses selective search which predicts 2000 region proposals per image. Secondly, it can also extract features from the given images and it will directly predict class probabilities and bounding boxes which is considered as a regression problem. YOLO network runs at 45 frames per second with no batch processing.


An improvised version of YOLO is YOLOv2. YOLOv2 adds a Batch Normalization layer ahead of each convolutional layer which accelerates the network to get convergence and helps regularize the model.

  • High Resolution Classifier: In the previous version, the classifier uses an input resolution of 224 x 224, which will be later increased to 448 for detection. The network needs to adjust to the new resolution inputs when switching to object detection tasks. To avoid this, YOLOv2 adds a classification network at 448 × 448 for 10 epochs on ImageNet dataset to fine tune the process.
  • Uses Convolutional with anchor boxes. It predicts class and objectness for every anchor box by first removing the fully connected layers.
  • K-means clustering is used on the training set bounding boxes to automatically get good priors. It predicts the size and aspect ratio of anchor boxes using dimension clusters.
  • Fine-Grained Features: YOLOv2 concatenates the higher resolution features with the low resolution features by stacking adjacent features into different channels.
  • Multi-Scale Training: For networks to be robust to run on images of different sizes, for every 10 batches, the network randomly chooses a new image dimension size from {320, 352, …, 608}. This means the same network can predict detections at different resolutions. YOLOv2 proposes a new classification backbone namely Darknet-19 with 19 convolutional layers and 5 max pooling layers which requires less operations to process an image yet achieves high accuracy.


It is an improvised version of YOLOv2.

  • Firstly, it uses multi-label classification to adapt to more complex datasets, which contain many overlapping labels.
  • Uses three different scale feature maps to predict the bounding box. The last convolutional layer predicts a 3-d tensor encoding class predictions, objectness and bounding box.
  • Third, YOLOv3 proposes a deeper and robust feature extractor, called Darknet-53, inspired by ResNet.

Due to the advantages of multi-scale predictions, YOLOv3 can detect small objects even more but has comparatively worse performance on medium and larger sized objects.


It is a Single-shot detector that directly predicts category scores and box offsets for a fixed set of default bounding boxes of different scales at each location in several feature maps with different scales, as shown in Fig 3.(a). The default bounding boxes have different aspect ratios and scales in each feature map. In different feature maps, the scale of default bounding boxes is computed with regular space between the highest layer and the lowest layer, where each specific feature map learns to be responsive to the particular scale of the objects. For each default box, it predicts both the offsets and the confidences for all object categories.

Fig 3. Four methods utilizing features for different sized object prediction.


Deconvolutional Single Shot Detector is a modified version of SSD, which adds a prediction module and along with that deconvolution module also adopts ResNet-101 as backbone. The architecture of DSSD is shown in above Fig.3.(b). Deconvolution module increases the resolution of feature maps to strengthen features. Each deconvolution layer followed by a prediction module helps to predict a variety of objects with different sizes.

There are many other detectors such as M2Det, RefineDet , DCNv2, NAS-FPN, etc that are explained in this paper.

Performance of benchmark datasets and their Metrics for Object Detection

Using challenging datasets as a benchmark is significant in many areas of research, because they are able to draw a standard comparison between different algorithms and set goals for solutions. Early algorithms focused on face detection using various ad hoc datasets. Later on, more realistic and challenging face detection datasets were created. Another popular challenge is the detection of pedestrians for which several datasets have been created.

PASCAL VOC dataset

It contains 20 object categories (such as person, bicycle, bird, bottle, dog, etc) that spread over 11,000 images. These 20 categories can be categorized as four main categories- vehicles, animals, household objects and people. For the VOC2007 criteria, the interpolated average precision was used to evaluate both classification and detection. Sample images from a pascal dataset are provided below.

Fig.4. Annotated sample images from the PASCAL VOC dataset

MS COCO benchmark

The Microsoft Common Objects in Context (MS COCO) dataset for detecting and segmenting objects found in everyday life in their natural environments contains 91 common object categories with 82 of them having more than 5,000 labeled instances. In total the dataset has 2,500,000 labeled instances in 328,000 images.

Fig.5. MS COCO dataset with three different types of images sampled in the dataset, including iconic objects, iconic scenes and non-iconic objects.

ImageNet benchmark

The ILSVRC challenge of object detection evaluates the ability of an algorithm to name and localize all instances of all target objects present in an image. ILSVRC2014 has 200 object classes and nearly 450k training images, 20k validation images and 40k test images.

ImageNet uses a loosen threshold calculated as:

where w and h are width and height of a ground truth box respectively. This threshold allows for the annotation to extend up to 5 pixels on average in each direction around the object. Comparison between ILSVRC Object Detection dataset and PASCAL VOC dataset is shown in below table

Analysis Of General Image Object Detection Methods

  • Deep neural network based object detection pipelines have four steps in general, image pre-processing, feature extraction, classification and localization, post-processing.
  • Firstly, raw images from the dataset cant be fed into the network directly. Therefore, we need to resize them to any special sizes and make them clearer, such as enhancing brightness, color, contrast.
  • For flipping, rotation, scaling, cropping, translation and adding Gaussian noise, we can use data augmentation. In addition to that, we can use Generative Adversarial Networks(GANs) to generate new images.
  • Secondly, for further detection, we use feature extraction. The feature quality directly determines the upper bound of subsequent tasks like classification and localization.
  • Thirdly, the detector head is responsible to propose and refine bounding box concluding classification scores and bounding box coordinates.
  • At last, the post-processing step deletes any weak detecting results.
  • To obtain precise detection results, several methods can be used alone or in combination with other methods that are clearly mentioned in the paper.


  • Object detection is widely used in many fields to assist people and also for important tasks.
  • In the security field, it is mainly used for face detection, fingerprint identification, fraud detection, anomaly detection etc.
  • In the military field, remote sensing object detection, topographic survey, flyer detection, etc., are representative applications.
  • In the transportation field, license plate recognition, automatic driving and traffic sign recognition, etc., greatly facilitate people’s life.
  • Object detection has a wide range of application scenarios. The research of this domain contains a large variety of branches like Highlight detection, Edge detection, Object detection in videos, 2D, 3D pose detection(sample image provided below)
Fig. 6. Some examples of multi-person pose estimation.


Object detection has been growing rapidly with the continuous upgrade of powerful computing equipment and achieving high accuracy and efficient detectors is the ultimate goal of this task. Researchers have developed a series of directions such as constructing new architecture, extracting rich features, exploiting good representations, improving processing speed, training from scratch, anchor-free methods, solving sophisticated scene issues (small objects, occluded objects), combining one-stage and two-stage detectors to make good results, improving post processing NMS method, solving negatives-positives imbalance issue, increasing localization accuracy to enhance classification confidence. With the increasing need of powerful object detectors in the security, military, transportation, medical and life fields the application of object detection is gradually extensive. In addition, a variety of branches in the detection domain arise. Although the recent achievements of this domain have been effective, there is still so much room for further development.

I hope my attempt to explain multiple object detection techniques along and the comparison between them is useful to you.