All about YOLOs — Part2— The First YOLO

Source: Deep Learning on Medium

All about YOLOs — Part2— The First YOLO

Before YOLO there were two major object detection frameworks, DPM(Deformable parts model) and R-CNN both region-based classifiers where, as a first step they would find regions and for the second step, pass those regions to a more powerful classifier to get them classified. This approach involved looking at images thousands of times to perform detection. YOLO started as a project to optimize this approach by building a single neural network that takes a single image and gives back the detections and class in a single pass. That’s why the pun “You Only Look Once.”

This 5-part series aims to explain everything that is there about YOLO, it’s history, how it’s versioned, it’s architecture, it’s benchmarking, it’s code and how to make it work for custom objects.

Here are the links for the series.

All about YOLOs — Part1 — a little bit of History

All about YOLOs — Part2 — The First YOLO

All about YOLOs — Part3 — The Better, Faster and Stronger YOLOv2

All about YOLOs — Part4 — YOLOv3, an Incremental Improvement

All about YOLOs — Part5 — Up and Running

Approach

Take an image and imagine an overlaying grid on top of that image. Each cell in the grid is responsible for predicting a few different things.

The first thing is that it’s going to predict some number of bounding boxes and also confidence values of each bounding box (probability of the box contains an object).

Note: There may be some grid cells that don’t have any objects nearby but still going to predict some bounding boxes but the confidence for those will be very low.

Note: The thickness of the line indicates the confidence value

When every cell in the grid tries to predict some bounding boxes, we could see a map of all the objects in the image with boxes ranked by their confidence values. This map basically shows where the objects are in the image but don’t necessarily know what the objects are.

The next step is for each cell is to predict class probabilities. A thing to note is that this probably doesn’t say that this grid cell contains that object. It’s a conditional probability that says if there is an object in the cell, then that object is that class.

For the next step, we take these conditional probabilities and multiply them with confidence for the bounding boxes to get all the bounding boxes weighted by their actual probabilities of containing that object. This map shows a bunch of detections for the classified objects and a lot of them have pretty low confidence values.

To get a single best detection for an object, we perform a Non-Max Suppression which is basically suppressing the non-maximum values. i.e. all the low confidence values leaving the best one as is.

This parameterization fixes the output size for each cell predictions. For each bounding box, it predicts 4 coordinates and 1 confidence values and some number of class probabilities. This leaves it with manageable parameters to predict and can be trained with one neural network to be a whole detection pipeline.

This kind of a seamless single network takes as much time as a typical classification network making the YOLO really fast and also achieve the “You only look once” part of the goal.