YOLO V3 Explained

Original article was published by Uri Almog on Artificial Intelligence on Medium


In this post we’ll discuss the YOLO detection network and its versions 1, 2 and especially 3.

In 2016 Redmon, Divvala, Girschick and Farhadi revolutionized object detection with a paper titled: You Only Look Once: Unified, Real-Time Object Detection. In the paper they introduced a new approach to object detection — The feature extraction and object localization were unified into a single monolithic block. Furthermore — the localization and classification heads were also united. Their single-stage architecture, named YOLO (You Only Look Once) results in a very fast inference time. The frame rate for 448×448 pixel images was 45 fps (0.022 s per image) on a Titan X GPU while achieving state-of-the-art mAP (mean average precision). Smaller and slightly less accurate versions of the network reached 150 fps. This new approach, together with other detectors built on light-weight Google’s MobileNet backbone, brought the vision (pun intended) of detection networks and other CV tasks on edge devices ever closer to reality. The original YOLO project can be found here.

By the way, Redmon seems to be a very colorful guy. His YOLO project site can be found here, where you can also find his resume.

Examining the GluonCV model zoo chart I presented in the previous post, one can see that in a fair comparison the different versions of YOLO-V3 (the red dots) as trained by GluonCV, achieve excellent accuracy, second only to the much slower faster-RCNN’s.

Detector performance chart. Source: GluonCV Model Zoo

The idea behind YOLO is this: There are no classification/detection modules that need to sync with each other and no recurring region proposal loops as in previous 2-stage detectors (see my post on early object detectors like RCNN). It’s basically convolutions all the way down (with the occasional maxpool layer). Instead of cropping out areas with high probability for an object and feeding them to a network that finds boxes, a single monolithic network needs to take care of feature extraction, box regression and classification. While previous models had two output layers — one for the class probability distribution and one for box predictions, here a single output layer contains everything in different features.

Yolo-V1 Architecture

YOLO V1 architecture. Later versions removed the FC layer. Source: https://arxiv.org/pdf/1506.02640.pdf

Yolo-V1 was the first appearance of the 1-stage detector concept. The architecture employed batch normalization (BN) and leaky ReLU activations, that were relatively new at the time. I’m not going to elaborate on V1 since it’s pretty outdated and lacks some of the strong features that were introduced later.

Yolo-V2 Architecture

Yolo-V2 Contains 22 convolutions and 5 maxpool operations. Feature map height represents spatial resolution. The 125-feature output is for VOC PASCAL dataset with 20 classes and 5 anchors. Source: Uri Almog

In version Yolo-V2 the authors, among other changes, removed the fully-connected layer at the end. This enabled the architecture to be truly resolution-independent (i.e. — the network parameters can fit any input resolution). This doesn’t necessarily mean that the network will perform well on any resolution. A resolution augmentation routine was employed for that during training. Redmon created multiple flavors of Yolo-V2, including smaller, faster (and less accurate) versions, like Tiny-Yolo-V2 etc.

Tiny-Yolo-V2 has an extremely simple architecture since it doesn’t have the strange bypass and rearrange operation that like its big brother. The tiny version is just a nice, long chain of convolutions and maxpools.

The configuration files describing these architectures can be found in the cfg section of the darknet github.

YOLO-V3 Architecture

Inspired by ResNet and FPN (Feature-Pyramid Network) architectures, YOLO-V3 feature extractor, called Darknet-53 (it has 52 convolutions) contains skip connections (like ResNet) and 3 prediction heads (like FPN) — each processing the image at a different spatial compression.

YOLO-V3 architecture. Source: Uri Almog

Like its predecessor, Yolo-V3 boasts good performance over a wide range of input resolutions. In GluonCV’s model zoo you can find several checkpoints: each for a different input resolutions, but in fact the network parameters stored in those checkpoints are identical. Tested with input resolution 608×608 on COCO-2017 validation set, Yolo-V3 scored 37 mAP (mean Average Precision). This score is identical to GluonCV’s trained version of Faster-RCNN-ResNet50, (a faster-RCNN architecture that uses ResNet-50 as its backbone) but 17 times faster. In that model zoo the only detectors fast enough to compete with Yolo-V3 (Mobilenet-SSD architectures) scored mAP of 30 and below.

Feature Pyramid Network (FPN): Dancing At Two Weddings

A Feature-Pyramid is a topology developed in 2017 by FAIR (Facebook A.I. Research) in which the feature map gradually decreases in spatial dimension (as is the usual case), but later the feature map expands again and is concatenated with previous feature maps with corresponding sizes. This procedure is repeated, and each concatenated feature map is fed to a separate detection head.

Feature Pyramid Network. Source: Feature Pyramid Networks for Object Detection

Referring to the YOLO-V3 illustration above, the FPN topology allows the YOLO-V3 to learn objects at different sizes: The 19×19 detection block has a broader context and a poorer resolution compared with the other detection blocks, so it specializes in detecting large objects, whereas the 76×76 block specializes in detecting small objects. Each of the detection heads has a separate set of anchor scales.

Unlike SSD (Single-Shot Detector) architectures, in which the 38×38 and 76×76 blocks would receive only the high-resolution, partly processed activations from the middle of the feature extractor (the top 2 arrows in the diagram), in FPN architecture those features are concatenated with the low-resolution, fully processed features at the end of the feature extractor.

This enables the network to dance at two weddings, as they say in Yiddish, and utilize both the highly-processed but narrow-context features and the partly-processed but wide-context features, for its predictions.

The output scheme for YOLO-V3 is the same as in V2, and they differ from the older V1.

YOLO-V2/V3 Output Scheme — A Single Layer Breakdown:

YOLO V2 and YOLO V3 output layer. Wout and Hout are spatial dimensions of the output feature map. For each anchor, the features are arranged in the described order. Source: Uri Almog

Each cell in the output layer’s feature map predicts 3 boxes in the case of Yolo-V3 and 5 boxes in YOLO-V2 — one box per anchor. Each box prediction consists of:

  1. 2 values for box center offsets(in x an y, relative to cell center),
  2. 2 values box size scales (in x and y, relative to anchor dimensions),
  3. 1 value for objectness score (between 0 and 1),
  4. number-of-classes values for class score (between 0 and 1).

(To be precise, the box size values are ‘residual values’. At postprocessing they are used to calculate the box width by

box_width = anchor_width * exp(residual_value_of_box_width) )

The YOLO-V2 illustration above is designed for the 20-class VOC PASCAL dataset and has 5 anchors. The 125-feature output is arranged as follows: for each spatial cell there are 125 versions. Feature 0 is the objectness score, features 1–2 are the x and y scales of the box, features 3–4 are the x and y offsets of the box center (relative to the cell coordinate itself), and features 5–24 are the 20 class scores. All this — for the first anchor. Features 25–49 repeat the same feature allocation — this time with the second anchor, and so forth. Note that the anchor dimensions are not expressed directly in the features but the scale values in those features pass to the postprocess and are assigned with the corresponding anchor scales for the box decoding.

The objectness is a new concept and is worth a discussion that will take place in a few paragraphs. For now let’s think of it as the network’s confidence that some object exists in a given box, and the class score is the conditional probability , given that there is an object in the box (i.e. probability of class x given an object exists in this box). The total confidence score for each class is thus the product of the objectness and the class score.

The output of the network then goes through the NMS and a confidence threshold to give the final predictions, as explained in my post on detector basics.

Why Does YOLO Perform Better Than Previous Architectures?

If there’s one thing I learned while working YOLO and its smaller versions, is that looks can deceive: YOLO has such a small and simple topology, how complicated can it be? Well. What enables it to have such a simple structure, or more precisely: compact, is the fact that its loss function is very complicated. It is the loss that gives the features their meaning. So a carefully and thoughtfully crafted loss function can pack a lot of information into a small feature map.

In the next section we will discuss the most important features of the YOLO loss function.

YOLO Training and Loss Mechanism

This section is based on a research I did on the training flow of the Darknet framework (the framework developed by Redmon), when I was working on an independent TensorFlow implementation of that framework.

Input Resolution Augmentation

As a fully-convolutional network — not containing fully-connected layers for the classification task as previous detectors did — it can process input images of any size. But, since different input resolutions give rise to different network parameters, the network is trained with resolution augmentation: The authors used 10 input resolution steps between 384×384 and 672×672 pixels that alternate randomly every few training batches, enabling the network to generalize its predictions for different resolutions.

Loss Coefficients — Divide and Conquer

Different boxes are treated differently by the loss function.

As explained in my previous post, each spatial cell in the network’s output layers predicts multiple boxes (3 in YOLO-V3, 5 in previous versions) — all centered in that cell, via a mechanism called anchors.

The YOLO loss for each box prediction is comprised of the following terms:

  1. Coordinate loss — due to a box prediction not exactly covering an object,
  2. Objectness loss — due to a wrong box-object IoU prediction,
  3. Classification loss — due to deviations from predicting ‘1’ for the correct classes and ‘0’ for all the other classes for the object in that box.
  4. A special loss that we’ll elaborate on two sections down.
YOLO-V1 loss function. The lambdas are loss coefficients. The top 3 lines are the loss contributed by the ‘best boxes’ (boxes that best capture GT objects in each spatial cell) while the 4th is due to the boxes that did not capture objects. In YOLO V2 and V3 the direct width and height predictions and the square root were replaced with a residual scale prediction to make the loss argument proportional to relative rather than absolute scale error. Source: You Only Look Once: Unified, Real-Time Object Detection

The quality of a box prediction is measured by its IoU (Intersection over Union) with the object it tries to predict (more precisely — with its ground truth box). IoU values range from 0 (the box completely misses the object) to 1.0 (a perfect fit).

For each spatial cell, for each box prediction in centered in that cell, the loss function finds the box with the best IoU with the object centered in that cell. This distinguishing mechanism between best boxes and all the other boxes is in the heart of the YOLO loss.

The best boxes and them alone incur coordinate loss (due to less than perfect fit with the object) and classification loss (due to classification errors). This pushes the network parameters associated with those boxes, to improve the box scale and location, as well as the classification. These boxes also incur confidence loss — which will be explained immediately. All the other boxes incur only the confidence loss.

Objectness Loss — Knowing One’s Worth

With each box prediction is associated a prediction called ‘objectness’. It comes in the place where in previous detectors like RCNN came the confidence that a region proposal contains an object, because it is multiplied by the lass score to give absolute class confidence. However, contrary to expectation — that prediction is actually an IoU prediction — i.e., how well the network thinks that the box covers the object. The objectness loss term teaches the network to predict a correct IoU, while the coordinate loss teaches the network to predict a better box (which eventually pushes the IoU toward 1.0). All the box predictions contribute to the objectness loss, but only the best-fitting boxes in each spatial cell contribute also a coordinate and classification loss.

Why do we need the objectness loss?

At inference we usually have multiple boxes with varying coverage for each object. We want the post-processing algorithm to choose the box that covers the object in the most exact way. We also want to choose the box that gives the correct class prediction for the object. How can the algorithm know which box to choose?

Firstly, the objectness tells us how good the coverage is — so boxes with very small objectness (<0.005) are discarded and don’t even make it to the NMS block (see explanation in my previous post). This helps remove about 90% of the boxes, that are just artifacts of the architecture and not real detections (of course, this depends on the task and the data. If your task is to detect bottle caps in a box full of bottle caps then you can expect to have a large number of real detections).

A detection result when NMS is suppressed. Objectness score tells the NMS which boxes to keep and which to drop. Image from COCO 2017 dataset.

Secondly, NMS is done for each class separately so the class score is scaled by the box objectness for a meaningful comparison. If we have two boxes with a high overlap, the first with objectness 0.9 and person probability 0.8 (weighted score 0.72), and the second with objectness 0.5 and person probability 0.3 (weighted score 0.15), the first box will persist and the second one will drop in the NMS, because the first box’s objectness made it more trustworthy.

Why are the ‘best boxes’ treated differently during training?

I didn’t see any explanation by Redmon on the subject, but my intuition is this: Think of a teacher who has the following strategy: on the first assignment — she looks for the students that do well and puts effort in checking and grading their homework so that they can excel in that subject. For focus, she doesn’t bother to correct the assignments of the less successful students. Instead, she gives them a chance to excel in another subject, in the next assignment.

The reason that only the best boxes are pushed to improve coverage and class, is focus. We want the training to converge and converge well. The network is rich with parameters and there’s plenty of work for all of them, so there’s no rush in optimizing them all at once. Trying to push the parameters of all the boxes to catch the same object — rewarding all of them for approximately catching the same object — may result in a very long and noisy trail down the loss landscape, or worse yet — getting stuck in sub optimal minima (because they might not learn well to detect objects with different characteristics, but will be stuck in a local minimum without being able to learn different behavior). It’s better to exploit the relative success of some boxes by pushing only them to succeed in this type of object, while letting the parameters corresponding with less successful objects explore other options (in a way that will shortly be explained).

On the other hand, we want all the boxes to experience the objectness loss — Why? We want all boxes — bad boxes included, to learn to tell if they’re good or bad, even if they haven’t learned anything else in their entire life (or at least in the training) — because the NMS depends on it (and even so — YOLO gives the ‘best boxes’ a higher objectness loss coefficient than the other boxes).

Contraction Loss — Exploit Vs. Explore

An interesting mechanism I call anchor contraction is not mentioned explicitly in the paper but I found it in the code. Each box prediction incurs a small loss proportional to its deviation from its original anchor shape and location. This action persists weakly but steadily during the first training epochs, after which the loss coefficient is scheduled to vanish.

While this loss term is nonzero, it generates a weak force that acts to contract each box prediction back to its anchor form.

Contraction loss: During training the two boxes are compared with the object centered at their cell (bird). The red box has the best IoU and will contribute coordinate loss and classification loss that will push it to better cover the bird and predict its category. The blue box will be pushed back to its anchor shape that may put it in a better position to capture caterpillars. Source: Uri Almog Photography.

This clever mechanism has the following effect: The boxes that did not succeed in capturing an object (those who are not included in the ‘best box’ group mentioned above) are pushed to return to their original anchor shape. Since the anchors were designed to be the best priors for capturing objects in the data set, this action increases the chance that the weights associated with those boxes will generate more successful boxes in future attempts. Meanwhile, the successful boxes (the ‘best box’ group) also experience this loss! But The coordinate and classification losses are much greater (their coefficients are larger) and they dominate the direction in which the parameters associated with those boxes will shift

After a few epochs it is assumed that the network has already learned to predict boxes reasonably well, and the anchor contraction stops allow the network parameters to fine-tune on the actual ground truth.