All about YOLOs — Part4 — YOLOv3, an Incremental Improvement

Source: Deep Learning on Medium

All about YOLOs — Part4 — YOLOv3, an Incremental Improvement

Extremely fast and accurate of all YOLO algorithms by far.

This 5-part series aims to explain everything that is there about YOLO, it’s history, how it’s versioned, it’s architecture, it’s benchmarking, it’s code and how to make it work for custom objects.

Here are the links for the series.

All about YOLOs — Part1 — a little bit of History

All about YOLOs — Part2 — The First YOLO

All about YOLOs — Part3 — The Better, Faster and Stronger YOLOv2

All about YOLOs — Part4 — YOLOv3, an Incremental Improvement

All about YOLOs — Part5 — Up and Running



First, during training, the YOLOv3 network is fed with input images to predict 3D tensors (which is the last feature map) corresponding to 3 scales, as shown in the middle one in the above diagram. The three scales are designed for detecting objects with various sizes. Here we take the scale 13×13 as an example. For this scale, the input image is divided into 13×13 grid cells, each grid cell corresponds to a 1x1x255 voxel inside a 3D tensor. Here, 255 comes from (3x(4+1+80)). Values in a 3D tensor such as bounding box coordinate, objectness score, and class confidence are shown on the right of the diagram.

Second, if the center of the object’s ground truth bounding box falls in a certain grid cell(i.e. the red one on the bird image), this grid cell is responsible for predicting the object’s bounding box. The corresponding objectness score is “1” for this grid cell and “0” for others. For each grid cell, it is assigned with 3 prior boxes of different sizes. What it learns during training is to choose the right box and calculate precise offset/coordinate. But how does the grid cell know which box to choose? There is a rule that it only chooses the box that overlaps ground truth bounding box most.

Lastly, how to choose the initial size of those 3 prior boxes? The author uses K-mean clustering to classify the total bounding boxes from the COCO dataset to 9 clusters before training. This results in 9 sizes chosen from 9 clusters, 3 for 3 scales. This prior information is helpful for the network to learn to compute box offset/coordinate precisely because intuitively, a bad choice of box size makes it harder and longer for the network to learn.

The above explanation is credited to

Let’s take a closer look at the improvements.

Improvements in YOLOv3

Bounding boxes

Bounding box prediction part is similar to YOLOv2, as, in x, y coordinates and width and height are predicted.

More bounding boxes per image

For an input image of the same size, YOLO v3 predicts more bounding boxes than YOLO v2. For instance, at its native resolution of 416 x 416, YOLO v2 predicted 13 x 13 x 5 = 845 boxes. At each grid cell, 5 boxes were detected using 5 anchors.

On the other hand, YOLO v3 predicts boxes at 3 different scales. For the same image of 416 x 416, the number of predicted boxes are 10,647. This means that YOLO v3 predicts 10x the number of boxes predicted by YOLO v2. You could easily imagine why it’s slower than YOLO v2. At each scale, every grid can predict 3 boxes using 3 anchors. Since there are three scales, the number of anchor boxes used in total is 9, 3 for each scale.

Class Predictions

For class predictions, SoftMax is not used like in YOLOv2, instead of independent logistic classifiers are used and binary cross-entropy loss is used to facilitate non-overlapping labels for multilabel classification in other more complex domain datasets.

Prediction Across Sales

Unlike YOLO and YOLO2 which predict the output at the last layer, YOLOv3 predicts boxes at 3 different scales as illustrated in the below image.

YOLO struggled with small objects. However, with YOLOv3 we see better performance for small objects, and that because of using short cut connections.

Features are extracted from these scales like the Feature Pyramid Network. The last layer predicts the bounding box, objectness and class predictions. The feature map is taken from 2 layers previous and is upsampled by 2X. A feature map is also taken from earlier in the network and merge it with the above up-sampled features using concatenation. It is like encoder-decoder architecture. This allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map.

K-means clustering is used in YOLOv3 as well to find the better bounding box prior. 9 anchor boxes in case of the COCO dataset.

Feature Extractor

Instead of Darknet19 like in YOLOv2, this uses YOLOv3 Darknet53. Much deeper and better. According to the author, this is better than Resnet101 and Resnet 152.


Here is a diagram of YOLOv3’s network architecture. It is a feature-learning based network that adopts 75 convolutional layers as its most powerful tool. No fully-connected layer is used. This structure makes it possible to deal with images with any sizes. Also, no pooling layers are used. Instead, a convolutional layer with stride 2 is used to downsample the feature map, passing size-invariant feature forwardly. In addition, a ResNet-alike structure and FPN-alike structure is also a key to its accuracy improvement.


YOLOv3 is much better than SSD, FPN, and any other two-stage Faster R-CNN variants and has similar performance as DSSD.