A Note on various Object Detection Algorithms

Original article was published by Himadri Sankar Chatterjee on Deep Learning on Medium


A Note on various Object Detection Algorithms

With the advancement in the world of technology, automation in every aspect of job has gained a lot of attention. Starting from chatbots to self-driving cars, this field of research and application is taking a huge leap. This post is on the application of deep neural network architectures in the field of object detection. I first thought of going into details on the various algorithms available, but that would turn out to be a huge resource sounding more like a boring lecture. This one will contain a general overview of some of the most applied algorithms, viz. Sliding Window, R-CNN and YOLO, and get you started with the basics. This is still going to be a long post, I am sorry.

Given an image, a Convolution Neural Network can classify it into different classes. But the real problem lies in detecting multiple objects in a single image. When there are more than one object in an image, the challenge lies in identifying each object and locating them in the image. Earlier, a CNN was being applied on closely cropped images of the objects or on images containing only one object. But certainly, to train the network for both object detection and localization, we need to train it on images with one or more than one objects and a clear boundary marking its position in the image. Let’s define the output from the network, i.e., what we want the network to predict. We define our output vector yₑₓₚ , from the network that is trained to detect N classes of objects, as:
Yₑₓₚ:
[Pᵢ,
L ₓ, / B
L ᵧ, / B
L ₕ, / B ₕ
L ₗ, / B ₗ
C₁,
C₂,
C₃,
…,
…,
Cₙ]
Pᵢ → A probability value that defines whether the given image has any object that belongs to at least one of the N classes of image. Thus, it is 1, if there is an object in the image or 0, if the image is just a scenery, or one with no particular object of interest.
Lₓ / Bₓ → Denotes the x coordinate of the center of the bounding box around the object location.
Lᵧ / Bᵧ → Denotes the y coordinate of the center of the bounding box around the object location.
Lₕ / Bₕ → Denotes the height of the bounding box.
Lₗ / Bₗ → Denotes the breadth of the bounding box.
C₁, C₂, …., Cₙ → They represent the different classes of object, the network is being trained to recognize. So, if there is a car in the image and C₂ is used to denote a car then the value of C₂ shall be 1, while the other parameters assumes the value 0.

An example of what the values of the output denote in a given image. Note: bₗ is denoted by b𝓌 here.

Now, to blindly train the neural network on a set of well-labelled images won’t produce satisfactory results. We need a smarter implementation of algorithms that does some efficient operation on the image and returns better results. First, the Sliding Window Detection Algorithm.

SLIDING WINDOW DETECTION ALGORITHM:

The Sliding Window Detection Algorithm is carried out as follows:
First, we train a ConvNet to classify among classes of objects, from closely cropped images of those objects. So, if the network is trying to identify pictures of humans, we are going to train it on images that contain only a person. The extra background of each of the images is removed. So, with given dataset, the output will be 1 if it is the image of a person and 0 otherwise.

How the detection window slides through the input matrix.

Next comes the job of detecting the object. For this we choose a square window of some particular size (, smaller in size compared to the size of the image). We now place this window on the input image and crop out that portion of the image which is under focus, by the window. We feed this cropped image into the ConvNet for the task of classification. Now, this window is first placed on the top left corner of the image and slid to the other side with some chosen stride(or jumps), then again downwards following the same stride. Thus the name Sliding Window Detection.

An example of Sliding Window over the complete image.

Each time, it returns a particular portion of the image to the ConvNet. The ConvNet classifies that portion of the image into the various classes (,if an object is present).
One downfall of the algorithm is that often the size of the object in the image might be larger than the size of the window chosen. That would return a negative result, even though the object might be present. One remedy is to use a few window of different sizes. We can also resize our image by some scale every time to ensure that the sliding window detects the object at least once. Although this might return correct results, but they are computationally unfavorable. The other option at hand is to implement Sliding Window Detection algorithm convolutionally. Here is the link to the video that gives a great explanation of this implementation.
Implementing Sliding Window Detection using Convolution

R-CNN or Region based CNN:

Next comes R-CNN or Region Based Convolution Neural Network. It is based on the concept of dividing the image into regions. We already have a ConvNet trained to recognize various types of objects. For the purpose of detecting the objects in the image, the following steps are performed by the algorithm.
It basically has to propose nearly 2k category-independent regions of interest based on selective search. The idea is that these regions might contain various objects of different sizes. This task of region proposal is achieved by Segmentation.

An example of Image Segmentation

The proposed regions of the image are then converted into a format acceptable by the previously trained ConvNet and fed into it. The ConvNet then performs the task of classification on these focused portions of the image.
Here are the steps applied in R-CNN:

This is the basic concept. Apart from that, this network also features a SVM (Support Vector Machine)that can perform image classification based on the features extracted by the ConvNet. It also applies a Linear Regression model to fine tune the region bounding box. Quite a lot of stuff, right? You do not need to focus much on them if you don’t need to dive deep into their implementation. As Andrew Ng said, “Don’t worry about it if you don’t understand.”

In spite of all these steps, it was still slow in real-time object detection from video footages. To enhance its performance there have been various updates to the version of this algorithm. Starting from Fast R-CNN to Faster R-CNN, and now Mask R-CNN.
The Fast R-CNN is based on sharing of computation. In other words, like we have done in the convolutional implementation of the sliding window, similarly, here we pass on all the information to the CNN for classification and for bounding box regression. This decreases the total computation cost that was earlier incurred in training over each region separately. Once you understand the convolutional implementation of the Sliding Window Detection Algorithm, you can easily determine the approach being taken here.
The Faster R-CNN was an intuitive modification of integrating the process of region proposal into the ConvNet. While the Mask R-CNN is an extended implementation of the Faster R-CNN to pixel-level image segmentation.
Here are the links to those papers, in case you need to know more.
R-CNN
Fast R-CNN
Faster R-CNN
Mask R-CNN

YOLO OBJECT DETECTION ALGORITHM (Finally):

YOLO or You Only Look Once algorithm takes a completely different approach from the others. While other algorithms perform classification and localization for each of the proposed regions in the image. YOLO both of the tasks in a single pass of the image through the network. What it does is, it divides the image into certain no. of regions and predicts the result for each of the region if the center of some object lies in that region. First let us see what it should output.
Bounding Box: We expect the network to predict a bounding box around the object in the image. We define the output as before:

Yₑₓₚ:
[Cₛ, Pᵢ,
L ₓ, / B
L ᵧ, / B
L ₕ, / B ₕ
L ₗ, / B ₗ
C₁,
C₂,
C₃,
…,
…,
Cₙ]
Cₛ → Known as the Confidence Score, it is given by the product of the probability (Pᵢ, of the presence of the object) with IoU (result, expected). I shall define IoU later.
Pᵢ → It defines whether the given image has any object that belongs to at least one of the N classes of image.
Lₓ / Bₓ → Denotes the x coordinate of the center of the bounding box around the object location.
Lᵧ / Bᵧ → Denotes the y coordinate of the center of the bounding box around the object location.
Lₕ / Bₕ → Denotes the height of the bounding box.
Lₗ / Bₗ → Denotes the breadth of the bounding box.
C₁, C₂, …., Cₙ → They represent the different classes of object, the network is being trained to recognize.
Here is just an example of what a trained network might be able to return for the given image after dividing it into 3 X 3 grids. This does not show the Confidence Score yet.

It can be easily understood from the image, how the output varies for each grid cell. The output vales for the bounding boxes are an approximation.
Here is another example. Suppose the network is trained to detect various street signs, applicable to autonomous driving application. The red color box denotes the bounding box (Bₒ) predicted by the network and the green color denotes the expected bounding box (Bₑ).

So, what does the algorithm does?
Firstly, as usual, we have a CNN (Convolutional Neural Network) trained to classify objects into different classes.
Next, it divides the image into S x S grid cells or regions. It trains the network to return a result for each grid cell, if it finds that the center of some object lies in that grid cell. It can be explained by the following images:

Here we have the given image with three objects. The image is divided into certain no. of cells and the network predicts bounding boxes for the cells that might contain the center of a object. Notice that the number of predicted bounding box is very large. How do we choose the correct one? Lets clear the idea on a few topics as we move on:

IoU (Intersection Over Union):

So, during training we have two values for the output: First, the truth value or the expected output. Second, the result we get after passing the image through the network
So, the expected output defines some bounding box (Bₑ) for a object, while the network predicts some other bounding box (Bₒ) for the same object.
IoU is defined as the ratio of the area covered by the intersection of Bₑ and Bₒ by the area covers by the union of Bₑ and Bₒ.

Thus, for an image we might get a lot of bounding boxes predicted by the network. We choose only the results that has the Confidence Score over a certain threshold value. Generally, this threshold value is taken to be 0.6 and above, but it depends on the developers discretion.
Here is an example of its application in detecting cars. Notice how the predicted and the expected bounding boxes almost overlap. Thus, it has a high IoU.

Non-Max Suppression:
During application of the network, while performing the detection for objects in the image, the network might return many bounding boxes for an object with confidence score greater than the threshold value. Which one should we choose.
We consider the bounding box with the maximum IoU. Out of all the results we suppress the boxes with less IoU and consider the one with the highest IoU. Thus, it is known as Non-Max Suppression.

In the above image, say the network predicted three bounding boxes for the same image with IoU surpassing the threshold value (say, 0.75). Non-Max Suppression ensures that the bounding box with the highest IoU (here 0.9) is chosen.

Anchor Boxes:
Another important concept is that of the Anchor Boxes. Suppose we have two differently shaped objects with their center lying in the same grid cell. In that case the network won’t predict correctly for one of the objects. Here we use Anchor Boxes. This is just a concept representing the possible shape or configuration of various objects.

The image above aptly shows how anchor boxes are useful when the two objects share the same cell for their center. The output for the cell containing the centers of both the objects will be:
Yₑₓₚ:

[0.82,
1,

Lₓ¹, / Bₓ¹
Lᵧ¹, / Bᵧ¹
Lₕ¹, / Bₕ¹
Lₗ¹, / Bₗ¹

0,
1,
0,
0.75,
1,
Lₓ², / Bₓ²
Lᵧ², / Bᵧ²
Lₕ², / Bₕ²
Lₗ², / Bₗ²

1,
0,
0]
Thus, suppose we have the CNN trained to detect three classes of objects, viz. pedestrians, cars and street signs. Let C₁ denotes pedestrians, C₂ denotes cars and C₃ denotes street signs. There is some Confidence Score for both of the bounding boxes B1(here, 0.82) and B2(here, 0.75). The Probability of the existence of both a car and a pedestrian is 1. The boundaries of their bounding boxes are defined by the parameters Lₓ¹/Lₓ², Lᵧ¹/Lᵧ², Lₕ¹/Lₕ², Lₗ¹/Lₗ². Similarly, for bounding box B1, we expect it to denote a car, so C₂ is 1 and the others are 0. While for the other bounding box B2 we have C₁ is 1 and the others are 0.
Next, we train our network based on the above-mentioned algorithm and obtain the results. Here is how a standard network implementation of YOLO shall look like. Notice that the output is a 7 x 7 x 30 vector. Thus the image was divided into 7 x 7 = 49 regions and the output vector for each grid cell contains 30 parameters for predictions of various objects.

How do we define the loss function during training? This is quite difficult and varies differently with the implementation. However this image provides a general concept about the function. (Read at your own risk)

Originally YOLO was trained on PASCAL VOC dataset, that could classify between 20 classes of objects. This has been proved to be useful for faster object detection. It has been successfully applied in real-time object detection in video footage. Over time, there have been a lot of up-gradation to the algorithm that led to YOLOv2 or YOLO9000 and similarly YOLO v3. Here are the links to the corresponding papers, in case you are interested.
YOLO
YOLO v2
YOLO v3
And here is a table comparing the performance of various algorithms on a given dataset. I have decided to cover YOLO extensively in a different blogpost.

Hope you found the information useful. In case of any suggestion, feel free to express in the comment. I am still in the stage of learning these topics and any suggestion for the improvement of the article is welcome.

Live Long and Code.