Evaluation Metrics for Object detection algorithms

Original article was published by vijay shanker Dubey on Deep Learning on Medium

Evaluation Metrics for Object detection algorithms

Table of contents

  • Different competitions, different metrics
  • Important definitions
  • Metrics
  • VOC AP and mAP
  • COCO mAP
  • References

Different competitions, different metrics

PASCAL VOC Challenge : The official documentation explaining their criteria for object detection metrics can be accessed here.The current metrics used by the current PASCAL VOC object detection challenge are the Precision x Recall curve and Average Precision, which we will be discussing ahead.

COCO Detection Challenge : This challenge uses different metrics to evaluate the accuracy of object detection of different algorithms. Here you can find a documentation explaining the 12 metrics used for characterizing the performance of an object detector on COCO. These metrics will be discussed in the coming sections.

Google Open Images Dataset V4 Competition also uses mean Average Precision (mAP) over the 500 classes to evaluate the object detection task.

Important definitions

  • Intersection Over Union (IOU) :Intersection Over Union (IOU) is measure based on Jaccard Index that evaluates the overlap between two bounding boxes. It requires a ground truth bounding box and a predicted bounding box. By applying the IOU we can tell if a detection is valid (True Positive) or not (False Positive).
    IOU is given by the overlapping area between the predicted bounding box and the ground truth bounding box divided by the area of union between them:
  • True Positive(TP) : A correct detection. Detection with IOU ≥ threshold
  • False Positive (FP): A wrong detection. Detection with IOU < threshold
  • False Negative(FN) : A ground truth not detected
  • True Negative(TN) : It would represent a corrected misdetection. In the object detection task there are many possible bounding boxes that should not be detected within an image. Thus, TN would be all possible bounding boxes that were corrrectly not detected (so many possible boxes within an image). That’s why it is not used by the metrics.
The confusion Matrix
percentage of correct positive predictions
percentage of true positive detected among all relevant ground truths


  • Precision x Recall curve : The Precision x Recall curve is a good way to evaluate the performance of an object detector as the confidence is changed. There is a curve for each object class. An object detector of a particular class is considered good if its prediction stays high as recall increases, which means that if you vary the confidence threshold, the precision and recall will still be high. This statement can be more intuitively understood by looking at the above equations of P and R and keeping in mind that TP+FN = all ground truth = constant, so Recall increases, means TP increased, hence FN will decrease. As TP has increased, only if FP decreases, will the Precision remain high i.e. the model will be doing less mistakes and hence will be good. Usually, Precision x Recall curve start with high precision values, decreasing as recall increases. You can see an example of the Precision x Recall curve in the next topic (Average Precision).
  • Average Precision(AP) : It is calculated using area under the curve (AUC) of the Precision x Recall curve. As AP curves are often zigzag curves, comparing different curves (different detectors) in the same plot usually is not an easy task. In practice AP is the precision averaged across all recall values between 0 and 1.
  • Mean Average Precision(mAP) : The mAP score is calculated by taking the mean AP over all classes and/or over all IoU thresholds, depending on the competition.

VOC AP and mAP

PASCAL VOC 2012 challenge uses the interpolated average precision. It tries to summarize the shape of the Precision x Recall curve by averaging the precision at a set of eleven equally spaced recall levels [0, 0.1, 0.2, … , 1]


p(r~) is the measured precision at recall r~

Instead of using the precision observed at each point, the AP is obtained by interpolating the precision at each level r taking the maximum precision whose recall value is greater than r.

An illustrative example:

7 images with 15 ground truth objects(green boxes)and 24 detected objects(red boxes)

Please observe the following table. The last column identifies the detections as TP or FP. In this example a TP is considered if IOU≥30%, otherwise it is a FP. Note that, in some images there are more than one detection overlapping a ground truth that are TP (Images 2, 3, 4, 5, 6 and 7). For those cases the detection with the highest IOU is taken, discarding the other detections, as described in PASCAL VOC paper:

A prediction is positive if IoU > 0.5. If there are 5 detections of a single object , only 1 with the highest IoU is counted correct detection(TP) and rest 4 are false detections(FP)

The P-R curve is plotted by calculating the precision and recall values of the accumulated TP or FP detections. For this, first we need to order the detections by their confidences, then we calculate the precision and recall for each accumulated detection as shown in the table below:
Note: Total gt boxes = 15. So, recall will always be calculated as (Acc TP)/15 in this case.

For obtaining the interpolated average precision, the interpolated precision values are obtained by taking the maximum precision whose recall value is greater than its current recall value. The follwing curve is obtained:

For getting the AP for a given class, we just need to calculate the AUC(Area Under Curve) of the interpolated precision.

For PASCAL VOC challenge, only 1 IoU threshold of 0.5 is considered. So the mAP is the average of AP of all 20 object classes.


For the COCO 2017 challenge, the mAP was calculated by averaging the AP over all 80 object categories AND all 10 IoU thresholds from 0.5 to 0.95 with a step size of 0.05. The authors hypothesize that averaging over IoUs rewards detectors with better localization.

To make it more clear, first the AP is calculated for IoU threshold of 0.5 for each class i.e. We calculate the precision at every recall value(0 to 1 with a step size of 0.01), then it is repeated for IoU thresholds of 0.55,0.60,…,.95 and finally average is taken over all the 80 classes and all the 10 thresholds to get the primary metric used in the challenge.

Some more metrics have been defined in the challenge, like AP across scales for evaluating detections based on the object size inside the image and seeing whether the model is doing good for just large objects or just small objects or is it doing good for objects of varying sizes. The authors have defined small objects with those having area(h*w) less than 32² (on pixel scale), medium objects as those having area between 32² and 96² and large objects with area(h*w) greater than 96².

All the metrics used for COCO evaluation

NOTE : All the metrics are computed allowing for at most 100 top-scoring detections per image (across all categories). More details can be found here.