Source: Deep Learning on Medium
This article is part of object detection series and summaries the algorithm YOLO and its updates
In the previous article https://medium.com/@shaktimaan/rcnn-77b2aee7aa75 we have discussed a two-stage detector network RCNN used for object localization. There have been further improvements made by the author on top of the algorithm such as Fast-RCNN and Faster-RCNN which considerably increased the speed of object detection. Though there has been an improvement in speed inference hasn’t yet reached real-time speed because of two-stage process of region proposals followed by detection. To improve this few one-shot algorithms have been proposed which eliminate the step of region proposal and treat the problem as a classification+regression problem to directly predict the bounding boxes of detected objects and the corresponding classes in a single step. YOLO is one of the famous single shot detection algorithms(hence the name You Only Look Once(no 2 stages of looking)) which is the topic of this article.
The input image is divided into SxS cells as shown in the below image. The objects in the image are annotated manually within a bounding box and if the center of the bounding box falls within a particular cell in the grid, that cell is responsible for predicting the bounding box of the object. Each cell in the network predicts B bounding boxes of different sizes and shapes where an object can be present. Similarly each cell gives C class probabilities for C kinds of objects the model is being trained on. The ground truth object should ideally have the highest probability. Although B bounding boxes are predicted the model only predicts one object per cell, the bounding box having the highest IOU is considered the right one.
YOLO is implemented using a CNN with 24 convolutional layers and 2 fully connected layers as can be seen in below figure. To improve the speed of detection even more the author has developed a tiny model called Fast YOLO which has 9 convolutional and 2 fully connected layers.
The first 20 convolutional layers are pre-trained on ImageNet 1000 class competition dataset with image size 224×224. Once the convolution layers are trained for classification, another 4 convolutional layers are add along with 2 fully connected layers with random weights. The input size is also increased from 224×224 to 448×448 with the reason that detection requires more fine-grained visual information and hence the need for a higher resolution.
Final layer in yolo predicts both the class probabilities and bounding box co-ordinates for all the grid cells. The bounding box width and height is normalized by the width and height and image so that the values stay between 0 and 1. The bounding box x and y co-ordinate predictions are set to be offsets of a particular cell so that they are also bounded between 0 and 1.
Leaky relu is used as activation function for all layers except final layer. A linear activation function is used for the final layer. Sum of square error is used as loss for the model but it has issue that it gives equal weight to both error in predicting bounding boxes and error in predicting the class of object present in the cell which need to be correct. Also since the objects in an image are pretty limited the number of cells which don’t have object far outweigh the number not having objects and hence the classification loss from the cells which dont have any object can skew the loss and can result in divergence while training. To deal with this different weights are given λcoord = 5 and λnoobj = .5, this giving more weight to the cells having the object and its bounding box prediction loss.
Below is the monstrous looking loss function of YOLO.
1 obj i denotes objectness which indicates whether an object is present or not in the cell. 1 obj ij means that jth bounding box in the ith cell is the one which predicts the object. 1st term in the lost function caters for the offset x and y predicted from the cell. 2nd term denotes the error in width and height of the bounding box predicted. The square root is used to minimize the skew that can be caused by large width and height to the loss function. 3rd and 4th terms cater to the loss of network predicting if an object is present in the bounding box. 5th term denotes the loss in predicted class probabilities.
Each cell in the image can predict only one object, so if there are multiple small objects in the same cell the model won’t be able to detect them.
YOLOv2 improves on the previous YOLOv1 model by using a variety of techniques inspired from surrounding developments in neural network architectures.
Batch Normalization is used before convolutional layers which improves the training of the model and achieves a better accuracy. Also it helped in removing the dropout which was initially used to address the problem of dropout.
YOLOv1 used a model with 224×224 input images pre-trained on classification task and fine tunes it on object detection task. YOLOv2 also uses a pre-trained model but it trains for 10 epochs with images on 448×448 for classification task before fine tuning for object detection task, which improves the accuracy of the model.
YOLOv1 directly predicts the bounding boxes which can be a daunting task since the size and shape of an object can be highly variable depending on the dataset. To counter this YOLOv2 uses crafted anchor boxes for each cell and YOLOv2 predicts the offset from the anchor box which reduces the complexity of prediction. The anchor boxes sizes and shapes are determined by running a K-means clustering of all ground truth boxes and a k is picked based on experiments. k=5 is picked up as a good tradeoff between speed and accuracy. Also now each bounding box can have a different object which means a max of k objects can be predicted per cell compared to 1 for YOLOv1.
YOLOv2 is good enough to detect large objects present in the image but can miss smaller images due to lack of information of them at the final layers. To deal with similar issue Faster-RCNN and SSD train images of different size on the model. YOLO deals with the issue in a different way. Instead of training the model on different resolutions a skip connection is added from one of the beginning layers to pass the information directly to the final layer.
Dataset used :- http://pjreddie.com/media/files/VOCtrainval_06-Nov-2007.tar
The above dataset has 20 classes of images of objects and annotations. The first task is generating anchor boxes for the dataset and then training the model on top of that.