Original article was published by vijay shanker Dubey on Deep Learning on Medium

# Evaluation Metrics for Object detection algorithms

## Table of contents

- Different competitions, different metrics
- Important definitions
- Metrics
- VOC AP and mAP
- COCO mAP
- References

## Different competitions, different metrics

PASCAL VOC Challenge : The official documentation explaining their criteria for object detection metrics can be accessed here.The current metrics used by the current PASCAL VOC object detection challenge are the **Precision x Recall curve** and **Average Precision**, which we will be discussing ahead.

COCO Detection Challenge : This challenge uses different metrics to evaluate the accuracy of object detection of different algorithms. Here you can find a documentation explaining the 12 metrics used for characterizing the performance of an object detector on COCO. These metrics will be discussed in the *coming sections*.

Google Open Images Dataset V4 Competition also uses mean Average Precision (mAP) over the 500 classes to evaluate the object detection task.

## Important definitions

**Intersection Over Union (IOU)**:Intersection Over Union (IOU) is measure based on Jaccard Index that evaluates the overlap between two bounding boxes. It requires a ground truth bounding box and a predicted bounding box. By applying the IOU we can tell if a detection is valid (True Positive) or not (False Positive).

IOU is given by the overlapping area between the predicted bounding box and the ground truth bounding box divided by the area of union between them:

**True Positive(TP)**: A correct detection. Detection with IOU ≥*threshold***False Positive (FP)**: A wrong detection. Detection with IOU <*threshold***False Negative(FN)**: A ground truth not detected**True Negative(TN)**: It would represent a corrected misdetection. In the object detection task there are many possible bounding boxes that should not be detected within an image. Thus, TN would be all possible bounding boxes that were corrrectly not detected (*so many possible boxes within an image*). That’s why it is not used by the metrics.

## Metrics

**Precision x Recall curve**: The Precision x Recall curve is a good way to evaluate the performance of an object detector as the confidence is changed. There is a curve for*each object class*. An object detector of a particular class is considered good if its prediction stays high as recall increases, which means that if you vary the confidence threshold, the precision and recall will still be high. This statement can be more intuitively understood by looking at the above equations of P and R and keeping in mind that**TP+FN = all ground truth = constant**, so Recall increases, means TP increased, hence FN will decrease. As TP has increased, only if FP decreases, will the Precision remain high i.e. the*model will be doing less mistakes*and hence will be good. Usually, Precision x Recall curve start with high precision values,*decreasing*as recall increases. You can see an example of the Precision x Recall curve in the next topic (Average Precision).**Average Precision(AP)**: It is calculated using area under the curve (**AUC**) of the Precision x Recall curve. As AP curves are often zigzag curves, comparing different curves (different detectors) in the same plot usually is not an easy task. In practice AP is the precision averaged across all recall values between 0 and 1.**Mean Average Precision(mAP) :**The*mean AP over all classes***and/or**over all IoU thresholds, depending on the competition.

## VOC AP and mAP

PASCAL VOC 2012 challenge uses the **interpolated average precision**. It tries to summarize the shape of the Precision x Recall curve by averaging the precision at a set of eleven equally spaced recall levels [0, 0.1, 0.2, … , 1]

with

Instead of using the precision observed at each point, the AP is obtained by interpolating the precision at each level *r* taking the **maximum precision **whose recall value is greater than *r.*

**An illustrative example:**

Please observe the following table. The last column identifies the detections as TP or FP. In this example a TP is considered if **IOU≥30%**, otherwise it is a FP. **Note that**, in some images there are more than one detection overlapping a ground truth that are TP (Images 2, 3, 4, 5, 6 and 7). For those cases the detection with the *highest IOU* is taken, discarding the other detections, as described in PASCAL VOC paper:

A prediction is positive if IoU > 0.5. If there are 5 detections of a single object , only 1 with the highest IoU is counted correct detection(TP) and rest 4 are false detections(FP)

The P-R curve is plotted by calculating the precision and recall values of the accumulated TP or FP detections. For this, first we need to order the detections by their confidences, then we calculate the precision and recall for each accumulated detection as shown in the table below: **Note**: Total gt boxes = 15. So, recall will always be calculated as **(Acc TP)/15** in this case.

For obtaining the interpolated average precision, the interpolated precision values are obtained by taking the** maximum precision** whose **recall value is greater than its current recall value. **The follwing curve is obtained:

For getting the *AP for a given class*, we just need to calculate the *AUC**(Area Under Curve) of the interpolated precision*.

For *PASCAL VOC challenge*, only 1 IoU threshold of 0.5 is considered. So the *mAP* is the average of AP of all **20 object classes.**

## COCO mAP

For the COCO 2017 challenge, the mAP was calculated by averaging the AP over all *80 object categories* **AND** all *10 IoU thresholds* from 0.5 to 0.95 with a step size of 0.05. The authors hypothesize that **averaging over IoUs rewards detectors with better localization**.

To make it more clear, first the AP is calculated for IoU threshold of 0.5 for each class i.e. We calculate the precision at every recall value(0 to 1 with a step size of 0.01), then it is repeated for IoU thresholds of 0.55,0.60,…,.95 and finally average is taken over all the 80 classes and all the 10 thresholds to get the primary metric used in the challenge.

Some more metrics have been defined in the challenge, like AP across scales for evaluating detections based on the **object size** inside the image and seeing whether the model is doing good for just large objects or just small objects or is it doing good for objects of varying sizes. The authors have defined s**mall objects **with those having **area(h*w)** less than 32² (on pixel scale)**, medium objects** as those having area between 32² and 96² and** large objects **with **area(h*w) greater than 96².**

**NOTE** : All the metrics are computed allowing for **at most 100 top-scoring **detections per image (across all categories). More details can be found here.