Evolution of Object Detection and Localization Algorithms

Understanding recent evolution of object detection and localization with intuitive explanation of underlying concepts.

Object detection is one of the areas of computed vision that is maturing very rapidly. Thanks to deep learning! Every year, new algorithms/ models keep on outperforming the previous ones. In-fact, one of the latest state of the art software system for object detection was just released last week by Facebook AI team. The software is called Detectron that incorporates numerous research projects for object detection and is powered by the Caffe2 deep learning framework.

Today, there is a plethora of pre-trained models for object detection (YOLO, RCNN, Fast RCNN, Mask RCNN, Multibox etc.). So, it only takes a small amount of effort to detect most of the objects in a video or in an image. But the objective of my blog is not to talk about the implementation of these models. Rather, it is my attempt to explain the underlying concepts in a clear and concise manner.

I recently completed Week 3 of Andrew Ng’s Convolution Neural Network course in which he talks about object detection algorithms. Most of the content of this blog is inspired from that course.

Brief introduction about CNN

Before I explain the working of object detection algorithms, I want to spend a few lines on Convolutional Neural Networks, also called CNN or ConvNets. CNNs are the basic building blocks for most of the computer vision tasks in deep learning era.

Fig. 1. Convolution demo in Excel

What we want? We want some algorithm that looks at an image, sees the pattern in the image and tells what type of object is there in the image. For e.g., is that image of Cat or a Dog.

What is image for a computer? Just matrix of numbers. For e.g. see the figure 1 above. The image on left is just a 28*28 pixels image of handwritten digit 2 (taken from MNIST data), which is represented as matrix of numbers in Excel spreadsheet.

How can we teach computers learn to recognize the object in image? By making computers learn the patterns like vertical edges, horizontal edges, round shapes and maybe plenty of other patterns unknown to humans.

How computers learn patters? Convolutions!
(Look at the figure above while reading this) Convolution is a mathematical operation between two matrices to give a third matrix. The smaller matrix, which we call filter or kernel (3×3 in figure 1) is operated on the matrix of image pixels. Depending on the numbers in the filter matrix, the output matrix can recognize the specific patterns present in the input image. In example above, the filter is vertical edge detector which learns vertical edges in the input image. In context of deep learning, the input images and their subsequent outputs are passed from a number of such filters. The numbers in filters are learnt by neural net and patterns are derived on its own.

Why convolutions work? Because in most of the images, the objects have consistency in relative pixel densities (magnitude of numbers) that can be leveraged by convolutions.

I know that only a few lines on CNN is not enough for a reader who doesn’t know about CNN. But CNN is not the main topic of this blog and I have provided the basic intro, so that the reader may not have to open 10 more links to first understand CNN before continuing further.

After reading this blog, if you still want to know more about CNN, I would strongly suggest you to read this blog by Adam Geitgey.

Categorization of computer vision tasks

Fig. 2: Common computer vision tasks

Taking an example of cat and dog images in Figure 2, following are the most common tasks done by computer vision modeling algorithms:

  1. Image Classification: This is the most common computer vision problem where an algorithm looks at an image and classifies the object in it. Image classification has a wide variety of applications, ranging from face detection on social networks to cancer detection in medicine. Such problems are typically modeled using Convolutional Neural Nets (CNNs).
  2. Object classification and localization: Let’s say we not only want to know whether there is cat in the image, but where exactly is the cat. Object localization algorithms not only label the class of an object, but also draw a bounding box around position of object in the image.
  3. Multiple objects detection and localization: What if there are multiple objects in the image (3 dogs and 2 cats as in above figure) and we want to detect them all? That would be an object detection and localization problem. A well known application of this is in self-driving cars where the algorithm not only needs to detect the cars, but also pedestrians, motorcycles, trees and other objects in the frame. These kind of problems need to leverage the ideas or concepts learnt from image classification as well as from object localization.

Now coming back to computer vision tasks. In context of deep learning, the basic algorithmic difference among the above 3 types of tasks is just choosing relevant input and outputs. Let me explain this line in detail with an infographic.

1. Image Classification

Fig. 3: Steps for image classification using CNN

The infographic in Figure 2 shows how a typical CNN for image classification looks like.

1. Convolve an input image of some height, width and channel depth (940, 550, 3 in above case) by n-filters (n = 4 in Fig. 3) [if you are still confused what exactly convolution means, please check this link to understand convolutions in deep neural network].
2. The output of convolution is treated with non-linear transformations, typically Max Pool and RELU.
3. The above 3 operations of Convolution, Max Pool and RELU are performed multiple times.
4. The output of final layer is sent to Softmax layer which converts the numbers between 0 and 1, giving probability of image being of particular class. We minimize our loss so as to make the predictions from this last layer as close to actual values.

2. Object classification and localization

Fig. 4: Input and output for object localization problems

Now, to make our model draw the bounding boxes of an object, we just change the output labels from the previous algorithm, so as to make our model learn the class of object and also the position of the object in the image. We add 4 more numbers in the output layer which include centroid position of the object and proportion of width and height of bounding box in the image.

Simple, right? Just add a bunch of output units to spit out the x, y coordinates of different positions you want to recognize. These different positions or landmark would be consistent for a particular object in all the images we have. For e.g. for a car, height would be smaller than width and centroid would have some specific pixel density as compared to other points in the image.

Implying the same logic, what do you think would change if we there are multiple objects in the image and we want to classify and localize all of them? I would suggest you to pause and ponder at this moment and you might get the answer yourself.

3. Multiple objects detection and localization

Fig. 5: Input and output for object detection and localization problems

To detect all kinds of objects in an image, we can directly use what we learnt so far from object localization. The difference is that we want our algorithm to be able to classify and localize all the objects in an image, not just one. So the idea is, just crop the image into multiple images and run CNN for all the cropped images to detect an object.

The way algorithm works is the following:

1. Make a window of size much smaller than actual image size. Crop it and pass it to ConvNet (CNN) and have ConvNet make the predictions.
2. Keep on sliding the window and pass the cropped images into ConvNet.
3. After cropping all the portions of image with this window size, repeat all the steps again for a bit bigger window size. Again pass cropped images into ConvNet and let it make predictions.
4. At the end, you will have a set of cropped regions which will have some object, together with class and bounding box of the object.

This solution is known as object detection with sliding windows. It is very basic solution which has many caveats as the following:

A. Computationally expensive: Cropping multiple images and passing it through ConvNet is going to be computationally very expensive.

Solution: There is a simple hack to improve the computation power of sliding window method. It is to replace the fully connected layer in ConvNet with 1×1 convolution layers and for a given window size, pass the input image only once. So, in actual implementation we do not pass the cropped images one at a time, but we pass the complete image at once.

B. Inaccurate bounding boxes: We are sliding windows of square shape all over the image, maybe the object is rectangular or maybe none of the squares match perfectly with the actual size of the object. Although this algorithm has ability to find and localize multiple objects in an image, but the accuracy of bounding box is still bad.

Fig. 6. Bounding boxes from sliding window CNN

I have talked about the most basic solution for an object detection problem. But it has many caveats and is not most accurate and is computationally expensive to implement. So, how can we make our algorithm better and faster?

Better solution? YOLO

It turns out that we have YOLO (You Only Look Once) which is much more accurate and faster than the sliding window algorithm. It is based on only a minor tweak on the top of algorithms that we already know. The idea is to divide the image into multiple grids. Then we change the label of our data such that we implement both localization and classification algorithm for each grid cell. Let me explain this to you with one more infographic.

Fig. 7. Bounding boxes, input and output for YOLO

YOLO in easy steps:

1. Divide the image into multiple grids. For illustration, I have drawn 4×4 grids in above figure, but actual implementation of YOLO has different number of grids. (7×7 for training YOLO on PASCAL VOC dataset)

2. Label the training data as shown in the above figure. If C is number of unique objects in our data, S*S is number of grids into which we split our image, then our output vector will be of length S*S*(C+5). For e.g. in above case, our target vector is 4*4*(3+5) as we divided our images into 4*4 grids and are training for 3 unique objects: Car, Light and Pedestrian.

3. Make one deep convolutional neural net with loss function as error between output activations and label vector. Basically, the model predicts the output of all the grids in just one forward pass of input image through ConvNet.

4. Keep in mind that the label for object being present in a grid cell (P.Object) is determined by the presence of object’s centroid in that grid. This is important to not allow one object to be counted multiple times in different grids.

Caveats of YOLO and their solutions :

A. Can’t detect multiple objects in same grid. This issue can be solved by choosing smaller grid size. But even by choosing smaller grid size, the algorithm can still fail in cases where objects are very close to each other, like image of flock of birds.

Solution: Anchor boxes. In addition to having 5+C labels for each grid cell (where C is number of distinct objects), the idea of anchor boxes is to have (5+C)*A labels for each grid cell, where A is required anchor boxes. If one object is assigned to one anchor box in one grid, other object can be assigned to the other anchor box of same grid.

Fig. 8. YOLO with anchor boxes

B. Possibility to detect one object multiple times.

Solution: Non-max suppression. Non max suppression removes the low probability bounding boxes which are very close to a high probability bounding boxes.


As of today, there are multiple versions of pre-trained YOLO models available in different deep learning frameworks, including Tensorflow. The latest YOLO paper is: “YOLO9000: Better, Faster, Stronger” . The model is trained on 9000 classes. There are also a number of Regional CNN (R-CNN) algorithms based on selective regional proposal, which I haven’t discussed. Detectron, software system developed by Facebook AI also implements a variant of R-CNN, Masked R-CNN.


  1. You Only Look Once: Unified, Real-Time Object Detection
  2. YOLO9000: Better, Faster, Stronger
  3. Convolutional Neural Networks by Andrew Ng (deeplearning.ai)

Source: Deep Learning on Medium