A guide to the object detection exercise using YOLO model

Original article was published on Deep Learning on Medium

A guide to the object detection exercise using YOLO model

Object detection is an emerging technique in the field of Computer Vision that enables us to detect and recognize objects in an image or video. Object detection can be used to count objects in a scene and track their precise locations using localization method.

Here is a link to a sample object detection video:


How does deep learning assist to build a robust Computer Vision framework ?

Like other computer vision tasks, deep learning has proven to be an exemplary method to perform object detection in AI world. In this article , I will make an attempt to manifest the combination of image classification and localization which not only detect the class of the object also it detects it’s the corresponding location of a given image.

Let’s dive deep into the model architecture before proceeding to the detection phase.

What is YOLO model ?

“You Only Look Once” (YOLO) is a popular algorithm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

Source: https://arxiv.org/pdf/1506.02640v5.pdf

In the figure above, detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. The pertained convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.

Inputs and outputs

  • The input is a batch of images, and each image has the shape (m, 608, 608, 3)
  • The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c) as explained above. If you expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers. the variable c represents the number of classes model will use for detection purpose.(e.g. car , truck, traffic light etc.)

Since the model use 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, the model flattens the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).

Non-Max suppression

In the figure above, bounding boxes plotted on those objects for which the model had assigned a high probability, but this is still too many boxes. You’d like to reduce the algorithm’s output to a much smaller number of detected objects.

To do so, you’ll use non-max suppression. Specifically, you’ll carry out these steps:

  • Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class; either due to the low probability of any object, or low probability of this particular class).
  • Select only one box when several boxes overlap with each other and detect the same object.

In order to remove those overlapping boxes model will use a technique called Intersection over Union(IoU).This technique can be achieved using an evaluation metric called IoU which is used to measure the accuracy of an object detector on a particular dataset.

IoU: This validation metric is useful to determine the ratio of intersection and union area between ground-truth box and prediction box.

IoU calculation

In this figure above, B1 and B2 are the prediction and ground-truth box that are used the calculate the accuracy metric.

In a nutshell, summary of the YOLO looks like this:

  • Input image (608, 608, 3)
  • The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
  • After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
  • Each cell in a 19×19 grid over the input image gives 425 numbers.
  • 425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
  • 85 = 5 + 80 in the image array where 5 refers to (pc,bx,by,bh,bw), and 80 is the number of classes we’d like to detect
  • You then select only few boxes based on:
  • Score-thresholding: throw away boxes that have detected a class with a score less than the threshold
  • Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes
  • This gives you YOLO’s final output.


Detection of multiple class objects :