YOLOv3 Object Detection in TensorFlow 2.x

Original article was published by Anushka Dhiman on Deep Learning on Medium


YOLOv3 Object Detection in TensorFlow 2.x

You only look once (YOLO) is a state-of-the-art, real-time object detection system that is incredibly fast and accurate. In this article, we introduce the concept of Object Detection, the YOLO algorithm, and implement such a system in TensorFlow 2.0.

Object Detection

Object Detection is a computer vision technique for locating instances of object within images or videos. It is a key technology behind applications like surveillance systems, image retrieval system and advanced driver assistance system.

These systems involve not only recognizing and classifying every object in an image, but localizing each one by drawing the appropriate bounding box around it. This makes object detection a significantly harder task than its traditional computer vision predecessor, image classification.

There are different algorithms for object detection and they can be divide into two groups:

  1. Algorithms based on classification : These algorithms work in two stages. First, we select regions of interest(ROIs) from the image. Then, we classify those regions using convolutional neural networks. This process is very slow because we have to run prediction for every selected region. Example of this type of algorithms is the Region-based convolutional neural network (RCNN) and their version Fast-RCNN and Faster-RCNN.
  2. Algorithms based on regression: Instead of selecting regions of interest(ROIs) from the image, we predict classes and bounding boxes for the image in one run of the algorithm. Example of this type of algorithms is YOLO (You only look once).

YOLO

The YOLO model was first described by Joseph Redmon. In the 2015 paper titled “You Only Look Once: Unified, Real-Time Object Detection.”

At the time of first publishing (2016.) compared to systems like R-CNN and FRCNN, YOLO has achieved the state-of-the-art mAP (mean Average Precision). On the other hand, YOLO struggles to accurately localize objects. However, it learns general representation of the objects. In newer version, there are some improvements in both speed and accuracy.

We reframe the object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.

You Only Look Once: Unified, Real-Time Object Detection, 2015.

Simply, we take an image as input, pass it through a neural network which is similar to a CNN, and we get a vector of bounding boxes and class predictions in the output.

Interpret the prediction vector

The input image is divided into an S x S grid of cells. For each object that is present on the image, one grid cell is said to be responsible for predicting it. That is the cell where the center of the object falls into.

Each grid cell predicts B bounding boxes as well as C class probabilities. The bounding box prediction has five components x, y, w, h, confidence. The (x, y) coordinates represent the center of the box, relative to the grid cell location. These coordinates are normalized to fall between 0 and 1. The (w, h) box dimensions are also normalized to [0, 1], relative to the image size.

There is one more component in the bounding box prediction, which is the confidence score.

Formally we define confidence as Pr(Object) * IOU(pred, truth) . If no object exists in that cell, the confidence score should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

You Only Look Once: Unified, Real-Time Object Detection, 2015.

Each grid cell makes B of those predictions, so there are in total S x S x B * 5 outputs related to bounding box predictions.

It is also necessary to predict the class probabilities, Pr(Class(i) | Object). This probability is conditioned on the grid cell containing one object. It means that if no object is present on the grid cell, the loss function will not optimize it for a wrong class prediction. The network only predicts one set of class probabilities per cell, regardless of the number of boxes B, which makes S x S x C class probabilities in total.

Finally, add the class predictions to the output vector, we get a S x S x (B * 5 +C) tensor as output.

The Network

The network structure looks like a CNN, with convolutional and max pooling layers, followed by 2 fully connected layers in the end.

Loss Function

We only want one of the bounding boxes to be responsible for the object within the image since the YOLO algorithm predicts multiple bounding boxes for each grid cell. To achieve this, we use the loss function to compute the loss for each true positive.

This equation computes the loss related to the predicted bounding box position (x,y). The function computes a sum over each bounding box predictor (j = 0.. B) of each grid cell (i = 0 .. S²). 𝟙 obj is defined as:

  • 1, If an object is present in grid cell i and the jth bounding box predictor is “responsible” for that prediction
  • 0, otherwise.

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth.

You Only Look Once: Unified, Real-Time Object Detection, 2015.

The other term in the equation: (x, y) are the predicted bounding box position and (x̂, ŷ) are the actual position from the training data.

The second part of the equation, this is the loss related to the predicted box width / height.

Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

You Only Look Once: Unified, Real-Time Object Detection, 2015.

The third part of the equation, here we compute the loss associated with the confidence score for each bounding box predictor. C is the confidence score and Ĉ is the intersection over union of the predicted bounding box with the ground truth.𝟙 obj is equal to one when there is an object in the cell, and 0 otherwise. 𝟙 noobj is the opposite.

The λ parameters are used to differently weight parts of the loss functions. This is necessary to increase model stability. The highest penalty is for coordinate predictions (λ coord = 5) and the lowest for confidence predictions when no object is present (λ noobj = 0.5).

And the last part of the loss function is the classification loss, this looks like a normal sum-squared error for classification, except for the 𝟙 obj term.

YOLOv3

YOLOv3 is a real-time, single-stage object detection model that builds on YOLOv2 with several improvements. Improvements include the use of a new backbone network, Darknet-53 that utilize residual connections, or in the words of the author, “those newfangled residual network stuff”, as well as some improvements to the bounding box prediction step, and use of three different scales from which to extract features (similar to an FPN).

As mentioned in the original paper, YOLOv3 has 53 convolutional layers called Darknet-53 is shown in the following figure, which is mainly composed of Convolutional and Residual structures. It should be noted that the last three layers Avgpool, Connected and softmax layer are used for classification training on the Imagenet dataset. When we use the Darknet-53 layer to extract features from the picture, these three layers are not used.

Compared with ReseNet-101, the speed of Darknet-53 network is 1.5 times that of the former; although ReseNet-152 and its performance are similar, but it takes more than 2 times. In addition, Darknet-53 can also achieve the highest measurement floating-point operation per second, which means that the network structure can make better use of the GPU, thereby making it more efficient and faster.

YOLO makes detection in 3 different scales in order to accommodate different objects size by using strides of 32, 16, and 8.

Yolo predicts over 3 different scales detection, so if we feed an image of size 416×416, it produces 3 different output shape tensor, 13 x 13 x 255, 26 x 26 x 255, and 52 x 52 x 255.

For example, when we feed the input image of size 416×416 gets 3 branches after entering the Darknet-53 network. These branches undergo a series of convolutions, upsampling, merging, and other operations. Three feature maps with different sizes are finally obtained, with shapes of [13, 13, 255], [26, 26, 255] and [52, 52, 255]

Residual network is used to alleviate the gradient disappearance problem caused by increasing the depth in the neural network, thereby making the neural network easier to optimize.

Implementation in TF2.0

This code implementation was inspired by Rokas Balsys, from one of his great articles about the implementation of Yolov3 in his pylessons site.

The code for this designed to run on Python 3.7 and TensorFlow 2.0 can be found in my GitHub repository.

In my repo, you will find a notebook (.ipynb file) which is a detection code perform on images and video.

First, clone my repository:
git clone https://github.com/pythonlessons/TensorFlow-2.x-YOLOv3.git

Next, install the required python packages:
pip install -r ./requirements.txt

Now, download trained yolov3.weights:
wget -P model_data https://pjreddie.com/media/files/yolov3.weights

Now, test Yolo v3 detection:
python detection_demo.py

Detection by the Pre-trained Model

Detection by the Custom Trained Model

References:

  1. You Only Look Once: Unified, Real-Time Object Detection — Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
  2. YOLOv3: An Incremental Improvement — Joseph Redmon, Ali Farhadi
  3. https://pylessons.com/ — Rokas Balsys