The beginner’s guide to implementing YOLO (v3) in TensorFlow 2.0 (part-1)

Source: Deep Learning on Medium

The beginner’s guide to implementing YOLO (v3) in TensorFlow 2.0 (part-1)

Tutorial Overview

What this post is about?

Over the past few years in Machine learning, we’ve seen dramatic progress in the field of object detection. While there are several different models of object detection, but in this post, I want to talk specifically about one model called the “You Only Look Once” or in short YOLO. Invented by Joseph Redmon, Santosh Divvala, Ross Girshick and Ali Farhadi (2015), so far it has already 3 different versions, and in this post, we’are going to focus on the latest version, that is YOLOv3. Here, I’ll be sharing how to implement it in the newest version of TensorFlow which just released by Google last September 2019, the TensorFlow 2.0. For more information on how to install TensorFlow 2.0, you can follow my previous tutorial here.

Before we continue, I’ll give you the links to the original YOLO’s papers. Check them out below:

Who is this tutorial for?

When I got started learning the YOLO a few years ago, I found that it was really difficult for me to understand both the concept and implementation. Even though there are tons of blog posts and GitHub repos about it, but most of them are presented in the complex architectures and high-level programming style. However, they did a very great job.

Back then, I needed to push myself over the limit to learn them one after another and it ended me up to debug every single code, step by step, in order to grasp the core of the YOLO’s concept. Fortunately, I didn’t give up. After spending a lot of time, I finally made it works.

Based on that experience, I tried to make this tutorial easy and useful for many beginners who just started Deep Learning, especially for object detection. Without using complicated coding style, this tutorial can be a gentle explanation of the YOLOv3’s implementation in TensorFlow 2.0, and I hope, this will help you who are just getting started on that.


  • Familiar with Python 3
  • Understand object detection and Convolutional Neural Networks (CNNs).
  • Basic TensorFlow usage.

What will you get after completing this tutorial?

This tutorial is broken into 4 parts, they are:

  1. Part 1 (this part), I present a brief introduction of YOLOv3 and how the algorithm works.
  2. Part-2, I will be discussing how to parse the YOLOv3’s configuration file (yolov3.cfg) and to create the YOLOv3’s network from that.
  3. Part-3, we are going to look at how to load the YOLOv3’s pre-trained weights file (yolov3.weights), and to convert it into the TensorFlow’s 2.0 weights format.
  4. Part-4, as our last part for this tutorial, I will explain about the encoding process of the YOLOv3’s bounding boxes and get rid of non-necessary detected boxes using the non-maximum suppression (NMS).
    Finally, to complete this tutorial, we’re going to test this implementation with the images and videos.

Now, it’s time to get started this tutorial with a brief overview of everything that we’ll be seeing in this post. So, initially, for you who don’t have a lot of prior experience with this topic, I’m going to do a brief introduction about YOLOv3 and how the algorithm actually works.

What is YOLO?

As its name suggested, YOLO — You Only Look Once, it applies a single forward pass neural network to the whole image and predicts the bounding boxes and their class probabilities as well. This technique makes YOLO a super-fast real-time object detection algorithm. As mentioned in the original paper, YOLOv3 has 53 convolutional layers called Darknet-53 as you can see in the following figure.

How the YOLO works?

YOLOv3’s network divides an input image into S x S grid of cells and predicts bounding boxes as well as class probabilities for each grid. Each grid cell is responsible for predicting B bounding boxes and C class probabilities of objects whose centers fall inside the grid cell. Bounding boxes are the regions of interest (ROI) of the candidate objects. The “ B” is associated with the number of using anchors. Each bounding box has ( 5 + C) attributes. The value of “ 5” is related to 5 bounding box attributes, those are center coordinates (b x, b y) and shape (b h, b w) of the bounding box, and one confidence score. The “ C” is the number of classes. The confidence score reflects how confidence a box contains an object. The confidence score is in the range of 0–1. We’ll be talking this confidence score in more detail in the section Non-Maximum Suppression (NMS).

Since we have S x S grid of cells, after running a single forward pass convolutional neural network to the whole image, YOLOv3 produces a 3-D tensor with the shape of [ S, S, B * (5 + C].

The following figure illustrates the basic principle of YOLOv3 where the input image is divided into the 13 x 13 grid of cells ( 13 x 13 grid of cell is used for the first scale, while YOLOv3 actually uses 3 different scales and we're going to discuss it in the section prediction across scale).

YOLOv3 was trained on the COCO dataset with C=80 and B=3. So, for the first prediction scale, after a single forward pass of CNN, the YOLOv3 outputs a tensor with the shape of [(13, 13, 3 * (5 + 80)].

Anchor Box Algorithm

Basically, one grid cell can detect only one object whose mid-point of the object falls inside the cell, but what about if a grid cell contains more than one mid-point of the objects?. That means there are multiple objects overlap. In order to overcome this condition, YOLOv3 uses 3 different anchor boxes for every detection scale.

The anchor boxes are a set of pre-defined bounding boxes of a certain height and width that are used to capture the scale and different aspect ratio of specific object classes that we want to detect.

While there are 3 predictions across scale, so the total anchor boxes are 9, they are: (10×13), (16×30), (33×23) for the first scale, (30×61), (62×45), (59×119) for the second scale, and (116×90), (156×198), (373×326) for the third scale.

A clear explanation of the anchor box’s concept can be found in Andrew NG’s video here.

Prediction across scale

The YOLOv3 makes detection in 3 different scales in order to accommodate different objects size by using strides of 32, 16 and 8. This means, if we feed an input image of size 416 x 416, YOLOv3 will make detection on the scale of 13 x 13, 26 x 26, and 52 x 52.

For the first scale, YOLOv3 downsamples the input image into 13 x 13 and makes a prediction at the 82nd layer. The 1st detection scale yields a 3-D tensor of size 13 x 13 x 255.

After that, YOLOv3 takes the feature map from layer 79 and applies one convolutional layer before upsampling it by a factor of 2 to have a size of 26 x 26. This upsampled feature map is then concatenated with the feature map from layer 61. The concatenated feature map is then subjected to a few more convolutional layers until the 2nd detection scale is performed at layer 94. The second prediction scale produces a 3-D tensor of size 26 x 26 x 255.

The same design is again performed one more time to predict the 3rd scale. The feature map from layer 91 is added one convolutional layer and is then concatenated with a feature map from layer 36. The final prediction layer is done at layer 106 yielding a 3-D tensor of size 52 x 52 x 255.

Once again, YOLOv3 predicts over 3 different scales detection, so if we feed an image of size 416x 416, it produces 3 different output shape tensor, 13 x 13 x 255, 26 x 26 x 255, and 52 x 52 x 255.

Bounding box Prediction

For each bounding box, YOLO predicts 4 coordinates, tx, ty, tw, th. The tx and ty are the bounding box’s center coordinate relative to the grid cell whose center falls inside, and the tw and th are the bounding box’s shape, width and height, respectively.

The final output of the bounding box predictions need to be refined based on this formula:

The pw and ph are anchor’s width and height, respectively. The figure below describes this transformation in more detail.

The YOLO’s algorithm returns bounding boxes in the form of (b x, b y, b w, b h). The b x and b y are the center coordinates of the boxes and b w and b h are the box shape (width and height). Generally, to draw boxes, we use the top-left coordinate (x 1, y 1) and the box shape (width and height). To do this just simply convert them using this simple relation:

Total Class Prediction

Using the COCO dataset, YOLOv3 predicts 80 different classes. YOLO outputs bounding boxes and class prediction as well. If we split an image into a 13 x 13 grid of cells and use 3 anchors box, the total output prediction is 13 x 13 x 3 or 169 x 3. However, YOLOv3 uses 3 different prediction scales which splits an image into (13 x 13), (26 x 26) and (52 x 52) grid of cells and with 3 anchors for each scale. So, the total output prediction will be ([(13 x13) + (26×26)+(52×52)] x3) =10,647.

Non-Maximum Suppression

Actually, after single forward pass CNN, what’s going to happen is the YOLO network is trying to suggest multiple bounding boxes for the same detected object. The problem is how do we decide which one of these bounding boxes is the right one. Fortunately, to overcome this problem, a method called non-maximum suppression (NMS) is applied. Basically, what NMS does is to clean up these detections. The first step of NMS is to suppress all the predictions boxes where the confidence score under a certain threshold value. Let’s say the confidence threshold is set to 0.5, so every bounding box where the confidence score is less than or equal to 0.5 will be discarded.

Yet, this method is still not sufficient to choose the proper bounding boxes, because not all unnecessary bounding boxes can be eliminated by this step, then the second step of NMS is applied. The rest of the higher confidence scores are sorted from the highest to the lowest one, then highlight the bounding box with the highest score as the proper bounding box, and after that find all the other bounding boxes that have a high IOU ( intersection over union) with this highlighted box. Let’s say we’ve set the IOU threshold to 0.5, so every bounding box that has IOU greater than 0.5 must be removed because it has a high IOU that corresponds to the same of the highlighted object. This method allows us to output only one proper bounding box for a detected object. Repeat this process for the remaining bounding boxes and always highlight the highest score as an appropriate bounding box. Do the same step until all bounding boxes are selected properly.

End Notes

Here’s a brief summary of what we have covered in this part:

  • YOLO applies a single neural network to the whole image and predicts the bounding boxes and class probabilities as well which makes YOLO a super-fast real-time object detection algorithm.
  • YOLO divides an image into SxS grid cells. Every cell is responsible for detecting an object whose center falls inside.
  • To overcome the overlap objects whose centers fall in the same grid cell, YOLOv3 uses anchor boxes.
  • In order to facilitate the prediction across scale, YOLOv3 uses three different numbers of grid cells size (13×13), (28×28), and (52×52).
  • A Non-Max Suppression is used to eliminate the overlapping boxes and keep only the accurate one.

If I missed something or you have any questions, please don’t hesitate to let me know in the comments section.

So, this is the end of part-1. After a brief introduction, now it’s time to jump into practice. Let’s go get part-2.


Originally published at