Final Layers and Loss Functions of Single Stage Detectors (Part 1)

Source: Deep Learning on Medium

1.1 Motivation

Most deep object detectors consists of a feature extraction CNN (usually pre-trained on Imagenet and fine-tuned for detection) connected to a final layer that reshapes the features into the detector-specific output tensor. Switching up the feature CNN results in speed and accuracy changes as noted in [1], [2], [3]. But in many cases it is impractical in terms of memory and compute power for one to train an Imagenet CNN from scratch. Often times, we use an open-sourced, prebuilt model, adjusting the last layers and the loss functions to accomplish our task. The loss functions of one-stage object detectors, where one CNN produces the bounding box and class predictions, can be somewhat unusual because the prediction tensors are used to construct the truth tensor.

As part of the Oracle Machine Learning team, we have been reading the literature on such object detectors and producing explanations in the mathematical language that we prefer, in addition to creating diagrams, psuedo-code and mathematical formulas of our interpretation of what the authors meant, but left out. This is the write-up of the presentation we gave at an Oracle ML reading group. We originally wrote this in LaTex, but have converted our figures and equations to images to distribute on Medium.

1.2 Object Detection and PascalVOC

Given an image, the task of an object detector is to the return bounding box coordinates and name (class) of the objects that we care about in the image. Since it is difficult to talk about algorithms without concrete inputs, we take the PascalVOC dataset [4] as an example.

As shown in Figure 1, for each image, PascalVOC provides an annotation file containing the bounding box coordinates of objects in one of 20 classes. PascalVOC encodes bounding boxes by the top-left (x_min, y_min) and bottom-right (x_max, y_max) corner coordinates, but some object detection algorithms encode boxes using the center xy-coordinates with the width and height.

To feed an image into a convolutional neural network, the image is resized to be square. Since the PascalVOC bounding box coordinates depend on the image width W and height H, we normalize the box coordinates. Since the box encoding provided by PascalVOC is not the only way to encode bounding boxes, the normalization for both the corner-style and center-style encoding is shown in Equation 1.

Optimization requires numbers, so the name of the class of each bounding becomes an integer 𝕔 ∈ [1,20] since there are 20 classes in PascalVOC. For our purposes, to get from a class name to 𝕔, we find the index of the name in the list of PascalVOC classes following this order: [aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable, dog, horse, motorbike, person, pottedplant, sheep, sofa, train, tvmonitor].

Thus, for each image, the label is a list of objects represented by some normalized 4-dimensional bounding box b and an integer class id 𝕔.

2.1 The Yolo Detector

Since we are primarily interested in analyzing loss functions, all we really need to know about the Yolo CNN (Figure 2a), is that is takes an RGB image (448 × 448 × 3) and returns a cube (7 × 7 × 30), interpreted in (Figure 2b).

2.2 Yolo v1 bounding box encoding

To begin understanding the interpretation of the 7 × 7 × 30 output, we need to construct the Yolo-style label. Recall that the PascalVOC label for one image is a list of objects, each object being represented by a bounding box and a classification. The goal now is to convert the PascalVOC labels for one image into a form equivalent to the 7 × 7 × 30 tensor Yolo outputs. First, we need to convert from the center-normalized PascalVOC bounding box encoding to the Yolo bounding box encoding.

Instead of predicting the width and height directly, Yolo predicts the square roots to account for deviations in small numbers being more significant than the same deviation in big numbers. The square root mapping is used to expand smaller numbers, e.g. any number in [0,0.25] gets mapped to [0,0.5].

Instead of predicting the center of the bounding box normalized by the width and height of the image, Yolo predicts xy-offsets relative to a cell in a 7 × 7 grid. Once the image is divided into a 7 × 7 grid, for each object, we locate the grid cell (gx,gy) containing the object’s center. Having assigned the “responsibility” of predicting the object to a grid cell, we describe the center of the bounding box as offsets from cell as shown in Figure 3, thus completing the construction of the Yolo-style bounding box.

2.3 Assign truth box to predicted box by max IoU

Having assigned the object to a grid cell, we can now construct the truth vector y_(gx,gy) [0,1]³⁰, which requires the predictions _(gx,gy) located at the grid cell in the 7 × 7 × 30 tensor outputted from the Yolo CNN. As seen in Figure 2, each grid cell predicts two bounding boxes with their respective object existence probabilities P(Object) and a class probability distribution, so each cell only predicts one object and, at prediction time, we select the bounding box with the highest value of P(Object), which is the probability the box contains an object.

To make explanations clearer, we denote b as the true object bounding box and b̂₁ and b̂₂ as the predicted bounding boxes, all of which are in the Yolo encoding style described in Equation 2. We use the object class 𝕔 to construct the true class probability vector p [0,1]², in which all elements are zero except at index 𝕔, so p[𝕔]= 1. We define to be the predicted class probability vector.

We denote as 𝕔̂₁and 𝕔̂₂ for the “confidence” that box1 and box2, respectively, contain an object (P(Object) for the respective boxes). We assign b to one of box1 or box2 based on which predicted bounding box has the highest Intersection over Union, aka Jaccard Index, with b. For reference, we defined the procedure to compute the IoU for two rectangles in Algorithm 1. We set c to be the maximum IoU, effectively using the IoU as a proxy for the confidence of assigning the object to the predicted box. This process results in the truth vector y_(gx,gy), an example of which is depicted in Figure 4.

2.4 Yolo v1 loss function

Having used the object encoded as (b, 𝕔) and the prediction from grid cell (gx,gy) to construct y_(gx,gy), we can now formulate the loss L_(gx,gy) for the grid cell responsible for predicting the object. From the Yolo paper the loss function basically performs weighted linear regression, so once we construct the 30×30 weight matrix M_(gx,gy), we can compute the loss algebraically.

Following [1], we denote the weight on the bounding box coordinates as λ_coord, which is set to 5 in [1], and the weight on ĉ₁and ĉ₂ when the corresponding box do not contain objects as λ_noobj, which is set to 0.5 in [1]. To make equations easier, let Λ_coord be the 4×4 matrix with of all zeros except for λ_coord repeated on the diagonal and let 0_4×4 be the 4×4 matrix with of all zeros. Then we can define the 5×5 matrices Q_obj and Q_noobj to weight the bounding boxes predictions.

Given these definitions, We construct the three possible cases for M_(gx,gy): when b is assigned to b₁, when b is assigned to b₂, when there is no object (noobj) assigned to grid cell (gx,gy). For convenience, we define 𝕀 to be the 20×20 identity matrix because the probabilities are not weighted.

In the first two cases in Equation 6, the matrices are defined for when a grid cell has been assigned an object, but most of the grid cells for any given image would have no object labels, thus falling in the third case. For when the grid cell has no object assigned, there are no labels with which to compute y_(gx,gy), so we set y_(gx,gy)=0 and while we have the generalized matrix form of the loss above, the no object loss can be reduced to scalar products.

With L_(gx,gy) defined for all cases of grid cells, to get the loss L for the whole image, we sum the losses from all the grid cells.

We can see that this is equivalent to the formula of the loss for one image as presented in [1] and shown in Figure 2. (Note that while we expressed the loss in linear algebra because it is the language we prefer, in practice, we don’t implement it this way).


  1. Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015.
  2. Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. CoRR, abs/1612.08242, 2016.
  3. Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fis- cher, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. CoRR, abs/1611.10012, 2016.
  4. Mark Everingham, Luc Van Gool, C. K. I. Williams, J. Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge, 2010.
  5. Joseph Redmon. YOLO CVPR 2016 talk and slides. Google slides: presentation/d/1kAa7NOamBt4calBU9iHgT8a86RRHz9Yz2oh4-GTdX6M/edit#slide=id.p, Youtube: Accessed: 2018–10–12.