DeepLearning series: Objection detection and localization — YOLO algorithm, R-CNN

In the previous blog I explained the theory behind and how a Convolutional Neural Network works for a classification task. Here I will go a step further and touch on techniques used for object detection and localization, such as the YOLO algorithm and Regional Convolutional Neural Networks.


With object localization the network identifies where the object is, putting a bounding box around it.

This is what is called “classification with localization”. Later on, we’ll see the “detection” problem, which takes care of detecting and localizing multiple objects within the image.

But first things first.

For an object localization problem, we start off using the same network we saw in image classification. So, we have an image as an input, which goes through a ConvNet that results in a vector of features fed to a softmax to classify the object (for example with 4 classes for pedestrians/cars/bike/background). Now, if we want to localize those objects in the image as well, we change the neural network to have a few more output units that encompass a bounding box. In particular, we add four more numbers, which identify the x and y coordinates of the upper left corner and the height and width of the box (bx, by, bh, bw).

The neural network now will output the above four numbers, plus the probability of class labels (also four in our case). Therefore, the target label will be:

Where pc is the confidence of an object to be in the image. It responds to the question “is there an object?” Instead, c1,c2,c3, in case there is an object, tell if the object is part of class 1, 2 or 3. So, it tells us which object it is. Finally, bx, by, bh, bw identify the coordinates related to the bounding box around the detected object.

For example, if an image has a car, the target label will be:

In case the network doesn’t detect an object, the output is simply:

Where the question marks are placed in the rest of the positions that don’t provide any meaning in this case. Technically the network will output big numbers or NaN in these positions.

This technique is also used for “Landmarks detection”. In this case, the output will be even bigger since we ask the network to output the x and y coordinates of important points within an image. For example, think about an application for detecting key landmarks of a face. In this situation, we could identify points along the face that denote, for example, the corners of the eyes, the mouth, etc.


Object detection can be performed using a technique called “sliding window detection”. We train a ConvNet to detect objects within an image and use windows of different sizes that we slide on top of it. For each window, we perform a prediction.

This method gives pretty good results (I have used it for a project related to self-driving cars, and it achieved great outcomes. Check it out: link here).

The big downside of it is the computational cost, which is very extensive since we can have a lot of windows. The solution to that is the sliding window detection computed convolutionally.

Instead of sliding a small squeegee to clean a window, we now have a big one that fits the entire window and magically cleans it completely without any movement.

Let’s check this out!

The first step to build up towards the convolutional implementation of sliding windows is to turn the Fully Connected layers in a neural network into convolutional layers. See example below:

Great, now to simplify the representation, let’s re-sketch the final network in 2D:

If our test image is of dimension 16x16x3 and we had to perform the “regular” sliding window we would have to create 4 different windows of size 14x14x3 out of the original test image and run each one through the ConvNet.

This is computationally expensive and a lot of this computation is duplicative. We would like, instead, to have these four passes to share computation.

So, with the convolutional implementation of sliding windows we run the ConvNet, with the same parameters and same filters on the test image and this is what we get:

Each of the 4 subsets of the output unit is essentially the result of running the ConvNet with a 14x14x3 region in the four positions on the initial 16x16x3 image.

You might be wondering if this works on other examples too, and it does.

Think about an input image of 28x28x3. Going through the network, we arrive at the final output of 8x8x4. In this one, each of the 8 subsets corresponds to running the 14x14x3 region 8 times with a slide of 2 in the original image.

One of the weaknesses of this implementation is that the position of the bounding box we get around the detected object is not overly accurate.

We will soon see that the YOLO algorithm is the solution to that.

_ _ _ _ _ _ _ _

YOLO ALGORITHM: (You-Only-Look-Once)

Bounding box prediction

We start with placing a grid on top of the input image. Then, for each of the grid cells, we run the classification and localization algorithm we saw at the beginning of the blog. The labels for training, for each grid cell, will be similar to what we saw earlier, with an 8-dimensional output vector:

For each cell, we will get a result whether there is an object or not. For example:

The object is “assigned” to the specific cell looking to where the center falls.

If we have a 3×3 grid cell, then the target output volume will have a dimension of 3x3x8 (where 8 is the number of labels in y). So, in this case, we will run the input image through a ConvNet to map to an output of 3x3x8 volume.

So we have a convolutional implementation for the entire grid cells (not 9 individual ones), as we saw earlier. We, therefore, combine what we saw in the localization classification algorithm with the convolutional implementation.

The advantage of this algorithm is that it outputs precise positions of bounding boxes, as the values bx, by, bh, bw are computed relative to the cell. So, the finer grid we have the more precision we can obtain and also we have fewer chances of having multiple objects within a cell.

Intersection over Union (IoU)

This is a way of measuring if the object detection algorithm is working well.

It computes the intersection over the union of the detected bounding box and the correct one.


We identify a benchmark and consider an accurate object detection if the result of IoU is above that specific value. (i.e. IoU <= 0.5)

Clearly, the higher the IoU value, the more accurate results we have.

Non-max suppression

This technique is used to make our YOLO algorithm perform better.

In fact, YOLO could detect an object multiple times, since it’s possible that many grid cells detect the object. To avoid that, we take the following steps:

First, we assign a probability on each detection, then we take the “largest probability” box. We now look at the boxes that overlap the most with the “largest probability” box and remove the ones that have high IoU (so the ones that have a big area of intersection). Finally, the remaining box is the correct detection.

Remember that each prediction comes with a value pc, which identifies the prediction probability. We now discard, for example, all the boxes with pc <= 0.6.

While there are any remaining boxes then we do:

  • pick the box with the largest pc. Output that as prediction.
  • discard any remaining box with IoU>=0.5 with respect to the box output in the previous step.

If we have multiple classes (objects), then we implement non-max suppression independently for each one.

Anchor boxes

One of the problems with object detections as we have seen so far is the fact that each grid cell can only detect one object. If we have instead multiple objects in the same cell, the techniques we have used so far won’t help to discern them. Anchor boxes will help us overcome this issue.

The idea here is to predefine different shapes (called anchor boxes) for each object and associate predictions to each one of them. Our output label now will contain 8 dimensions for each of the anchor boxes we predefined.

If we chose two anchor boxes, then the class label will be:

So each object in the training image was assigned to the grid cell that contained that object’s midpoint (for a 3×3 grid, the output was 3x3x8). Now, each object in the training image is assigned to the grid cell that contains that object’s midpoint and the anchor box for the grid cell with highest IoU.

(for a 3×3 grid and 2 anchor boxes, the output is 3x3x16).

The only thing it can not handle well is in case two objects in the same cell have the same anchor box. Additionally, we get to choose and redefine the shape of the anchor boxes.

Putting it all together for YOLO

Quick tips when implementing the YOLO algorithm:

  • decide the grid size and the number of anchor boxes (as these two variables drive the dimension of the output volume y).
  • Train the ConvNet on the training images.
  • Run non-max suppression.



This algorithm tries to pick few regions within the image, which make sense to run the classifier. As for some regions of the image that contain no objects, it makes no sense to run the ConvNet classifier.

So first we need to find a way to find out where the objects are. We can do so by running a segmentation algorithm, which identifies blobs around objects. Then, we place a bounding box around each blob and run the classifier for each of these bounding boxes. It is a pretty slow algorithm as it proposes some regions and it classifies them one at a time.

To speed it out there, has been proposed the “fast R-CNN” algorithm. For this one, we still have the first step, which proposes the regions, but then it uses the convolution implementation of sliding windows to classify all the proposed regions.

Well, the first step is still a bit annoyingly slow, right?

Why not a “faster R-CNN”? Yes, it exists.

This one replaces the first step with the use of a convolutional network to propose regions.

Uh, what a journey!

This blog is based on Andrew Ng’s lectures at

DeepLearning series: Objection detection and localization — YOLO algorithm, R-CNN was originally published in Machine Learning bites on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium