YOLO V1- An Intuitive Guide

Original article was published on Deep Learning on Medium

YOLO V1- An Intuitive Guide

Real time object detection finds its use in a number of applications like security, surveillance, video analytics etc. YOLO V1 algorithm which was proposed in 2016 took the world by storm by being one of the fastest, most efficient algorithms for real time object detection. The following blogpost provides an insight into the same and aims to give you an intuitive understanding of the algorithm before you dive into all the math behind it.
The post assumes that you have an understanding of how neural networks and convolutional neural networks function.

The Problem

The problem of object detection consists of two parts

  • Object Localization: Locating a particular object in the image/ video.
  • Object Recognition: What is the label to be assigned to the object which is located in the image/video.

Previous Work and its Limitations

Deformable Parts Model (DPM)

This model involved a complex pipeline consisting of separate modules for feature extraction,region classification and prediction of bounding boxes. As it can be noted the model had way too many modules and complexities. Also, the image was divided into fixed size windows and the feature extractor was made to slide over all the windows to extract features from the entire image, which as in turn computationally expensive and not adequate for real time object detection.

Regional- Convolutional Neural Network (R-CNN)

This model again involved a very complex pipeline. It involved a region proposal network to propose regions most likely to contain an object. An algorithm called selective search was used to predict bounding box co- ordinates. A CNN was further used for feature extraction and support vector machines were used on top to generate a confidence score. This system was very slow given its complexity.

Deep Multibox

This algorithm was pretty similar to RCNN. The only difference was instead of selective search a CNN was trained to predict region proposals. The drawback of this model was that it was able to make a single class prediction but was not able to detect general objects i.e different types of objects.


YOLO stands for You Only Look Once. It signifies that unlike sliding a feature extractor on the image multiple times, the algorithm looks at the image only once for detecting objects in it. YOLO involves dividing the image into an S x S grid.

If the center of an object falls into a grid cell. That particular cell is considered to be responsible for detecting that particular object. Each grid cell predicts B number of bounding boxes and a confidence score for each of them signifying how strongly the algorithm feels that there is an object in the bounding box. Boxes with confidence scores over a particular threshold (0.25) are considered to be final ones.The confidence score has the formula:

Pr(obj) * IOUᵗʳᵘᵗʰₚᵣₑₔ

Here, Pr(obj) stands for probability that an object exists in the predicted bounding box.IOU stands for Intersection Over Union. As the name suggests it is the area of intersection between the actual bounding box and the predicted bounding box, divided by the Union of the areas between the actual and the predicted boxes.

Predicting a bounding box means predicting 5 values: x,y,w,h and confidence score (described above). (x,y) are the the co-ordinates of the center of the predicted box relative to its corresponding grid cell, w and h are the width and height of the box relative to the image.Finally, the algorithm also predicts conditional class probabilities i.e names of objects predicted in each grid cell. These probabilities are conditional since they are dependent upon an object actually being detected in a particular bounding box.

Network Design

The YOLO CNN consists of 24 convolutional layers and 2 fully connected layers. The detailsed structure can be seen above. It involved the use of 3 1 x 1 convolutional layers to reduce the the amount of features coming in.

The output is a tensor (a 3-D matrix) of dimensions 7 x 7 x 30. This is because the model is trained on the Pascal VOC dataset which has 20 object classes to predict (C=20). The image is divided into a 7 x 7 grid (S=7). Each cell is responsible for 2 bounding boxes (B=2). Each box predicts 5 values and a C number of probabilities to predict. Hence the output turns out to be
S x S x (B*5 +C) which on substitution becomes 7 x 7 x 30.


The convolutional layers of the yolo network were pre-trained on the imagenet dataset for over a week. Since imagenet dataset has over a 1000 classes of objects, the objective was to extract as many features as possible.
The network then was made to perform detection and optimise the following loss function:

Rather than delving into the math of it, simply put the loss function takes into account three losses

  • Classification loss: Difference between the predicted class probabilities and the actual class probabilities
  • Localization loss: Difference between predicted bounding box values (x,y,w and h) and actual bounding box values.
  • Confidence loss: Error in the confidence score

To delve deeper into the loss function I suggest going through this amazing post by Jonathan Hui.



As compared to the previous detection systems YOLO turned out to be very fast, since rather than a sliding window approach, it involves passing the image through the network only once. It was able to run at 45 frames per second on a Titan X GPU without any kind of batch processing.


Unlike alot of the other models, YOLO involves a single CNN. Hence no complex pipelines are required, which makes the model easy to understand and faster to execute. All tasks like feature extraction, bounding box prediction, object classification are taken care of a single network itself.

Fewer Bounding Boxes

Yolo predicts fewer bounding boxes than other models. As compared to R-CNN which propose around 2000 boxes per image, YOLO propose only around 98.


While YOLO v1 was highly successful, it did have its own limitations. It failed to detect small objects which appeared in groups. It was not able to generalize to objects with unusual aspect ratios. Also, in this algorithm, the errors weren’t normalized i.e a small error cannot be neglected in a smaller box as it can a larger one. That being said, the YOLO algorithm as been through a number of revisions, the latest being YOLO v4. I hope the post was able to give you an understanding of how this algorithm functions. I encourage the reader to go through the paper and also try it out on your own system.

Link to the original paper: https://arxiv.org/abs/1506.02640

Link to imageai api to try out yolo on your own system: https://imageai.readthedocs.io/en/latest/detection/