Understanding the YOLO Series of Algorithms

Original article can be found here (source): Deep Learning on Medium

1. Introduction To YOLO V1:

YOLO – stands for You Only Look Once which is a real time yet accurate object detection algorithm proposed by Redmon et. al. It is named so because the YOLO system looks at an image only once and makes final detection and recognition predictions based on it.

Firstly, YOLO is extremely fast at processing an image with the base version running at 45fps and a lighter version at an astounding 150 fps! which is more than the real time expectation.

YOLO re-frames the detection problem as a regression problem as against most of the previously proposed systems which use classifiers to perform detection. YOLO has a simple single flow pipeline which makes it computationally efficient and allows it to achieve the above mentioned real time speed.

Secondly, the YOLO systems focuses less on the pixel values, but learns the shapes, sizes and aspect ratios of objects pretty well this allows it to accurately perform detection on artwork data-sets which are different from natural images at the pixel level. Thus YOLO is said to learn general representations of objects unlike RCNN type models.

Finally, as mentioned above, the entire image is fed at once to the YOLO system which enables it to encode contextual information about the object classes thus making lesser background errors as compared to Fast RCNN which feeds patches/parts of the image to it’s system and not the entire image.

2. Detection :

2.1) The YOLO system divides the input image into an S * S grid. If the center of an object lies in that grid cell then that grid cell is responsible to detect that object.

2.2) Each cell predicts B bounding boxes and confidence scores associated with it. The boxes are characterized by:

Height, width relative to the whole image i.e. lying between 0 to 1. Coordinates x and y relative to the bounds of the cell lying between 0 and 1. Confidence Score is given as Pr(Object)* IOUₚᵣₑ.

Pr(Object) is the probability that an object is present in the cell

Intersection over Union (IOU) :

It is an evaluation metric used to check the accuracy of the predicted bounding box w.r.t the actual ground truth.