Evaluating performance of an object detection model

Source: Deep Learning on Medium

Evaluating performance of an object detection model

What is mAP ? How to evaluate the performance of an object detection model?

In this article you will figure out how to use mAP to evaluate the performance of an object detection model .What is mAP? How to calculate mAP along with 11-point interpolation?

Object detection and instance segmentation

We use machine learning and deep learning to solve regression or classification problem.

We used Root Mean Square(RMS) or Mean Average Percentage Error(MAPE) etc. to evaluate the performance of a regression model.

Classification models are evaluated using Accuracy, Precision, Recall or an F1- Score.

Is object detection, a classification or a regression problem?

Multiple deep learning algorithms exists for object detection like RCNN’s : Fast RCNN, Faster RCNN, YOLO, Mask RCNN etc.

Objective of an object detection models is to

  • Classification :Identify if an object is present in the image and the class of the object
  • Localization : Predict the co-ordinates of the bounding box around the object when an object is present in the image. Here we compare the co-ordinates of ground truth and predicted bounding boxes

We need to evaluate performance of both classification as well as localization of using bounding boxes in the image

How do we measure the performance of object detection model?

For object detection we use the concept of Intersection over Union (IoU). IoU computes intersection over the union of the two bounding boxes; the bounding box for the ground truth and the predicted bounding box

Red is ground truth bounding box and green is predicted bounding box

An IoU of 1 implies that predicted and the ground-truth bounding boxes perfectly overlap.

You can set a threshold value for the IoU to determine if the object detection is valid or not not.

Let’s say you set IoU to 0.5, in that case

  • if IoU ≥0.5, classify the object detection as True Positive(TP)
  • if Iou <0.5, then it is a wrong detection and classify it as False Positive(FP)
  • When a ground truth is present in the image and model failed to detect the object, classify it as False Negative(FN).
  • True Negative (TN): TN is every part of the image where we did not predict an object. This metrics is not useful for object detection, hence we ignore TN.

Set IoU threshold value to 0.5 or greater. It can be set to 0.5, 0.75. 0.9 or 0.95 etc.

Use Precision and Recall as the metrics to evaluate the performance. Precision and Recall are calculated using true positives(TP), false positives(FP) and false negatives(FN).

Calculate precision and recall for all objects present in the image.

You also need to consider the confidence score for each object detected by the model in the image. Consider all of the predicted bounding boxes with a confidence score above a certain threshold. Bounding boxes above the threshold value are considered as positive boxes and all predicted bounding boxes below the threshold value are considered as negative.

use 11-point interpolated average precision to calculate mean Average Precision(mAP)

How to calculate mAP using 11 point interpolation?

Step 1: Plot Precision and Recall

Plot the precision and recall values on a Precision Recall(PR) graph. PR graph is monotonically decreasing, there is always a trade-off between precision and recall. Increasing one will decrease the other. Sometimes PR graph is not always monotonically decreasing due to certain exceptions and/or lack of data.

Source: https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html

Step 2: Calculate the mean Average Precision(mAP), use 11 point interpolation technique.

Interpolated precision is average precision measured at 11 equally spaced recall levels of 0.0, 0.1, 0.2, 0.3 ….0.9, 1.0 as shown in the figure above

The PR graph sometimes may not be monotonically decreasing, to resolve the issue, we set max of precision for a value of recall. Graphically, at each recall level, we replace each precision value with the maximum precision value to the right of that recall level i.e.; we take the maximum of all future points

The rationale is the willingness to look at higher precision values if both precision and recall get better.

Finally calculate the arithmetic mean of the interpolated precision at each recall level for each information in the test collection.

mAP is always calculated over the entire dataset.

Let’s understand with an example as shown below, recall values are sorted for us to plot the PR graph

Sample Precision and Recall values

11 point interpolation will use the highest value for the precision for a recall value.

We create 11 equally spaced recall levels of 0.0, 0.1, 0.2, 0.3 ….0.9, 1.0

Recall of 0.2 has the highest precision value of 1.00. Recall value of 0.4 has different precision values 0.4, 0.67, 0.5. In this scenario, we use the highest precision value of 0.67. When precision value is 0.6, we have precision value of 0.5 but for a recall of 0.8, we see a higher precision value of 0.57. Based on the rationale for 11 point interpolation, we take the maximum of all future points, so the precision that we need consider is 0.57 instead of 0.5. Finally for a recall of 1.0, we take the max precision which is 0.5.

Now plotting the Precision Recall as well as the Interpolated precision.

We finally apply the mean average precision formula

AP =1/11(4* 1.0 + 2 * 0.67+ 4*0.57 + 1*0.5) =0.74

This gives us the mean average precision using 11 point interpolation

Python code for calculating mAP for Pascal VOC data format

Pascal VOC Bounding box is defined by (x-top left, y-top left,x-bottom right, y-bottom right)

#GT Boxes
gt_boxes= {"img_00285.png": [[480, 457, 515, 529], [637, 435, 676, 536]]}
#Pred Boxes
pred_boxs={"img_00285.png": {"boxes": [[330, 463, 387, 505], [356, 456, 391, 521], [420, 433, 451, 498], [328, 465, 403, 540], [480, 477, 508, 522], [357, 460, 417, 537], [344, 459, 389, 493], [485, 459, 503, 511], [336, 463, 362, 496], [468, 435, 520, 521], [357, 458, 382, 485], [649, 479, 670, 531], [484, 455, 514, 519], [641, 439, 670, 532]], "scores": [0.0739, 0.0843, 0.091, 0.1008, 0.1012, 0.1058, 0.1243, 0.1266, 0.1342, 0.1618, 0.2452, 0.8505, 0.9113, 0.972]}}

Importing required libraries

import numpy as np
from copy import deepcopy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Create a dictionary of image id and confidence score

def get_model_scores(pred_boxes):
"""Creates a dictionary of from model_scores to image ids.
Args:
pred_boxes (dict): dict of dicts of 'boxes' and 'scores'
Returns:
dict: keys are model_scores and values are image ids (usually filenames)
"""
model_score={}
for img_id, val in pred_boxes.items():
for score in val['scores']:
if score not in model_score.keys():
model_score[score]=[img_id]
else:
model_score[score].append(img_id)
return model_score

Calculate the IoU for bounding boxes with Pascal VOC format

Pascal VOC bounding boxes
def calc_iou( gt_bbox, pred_bbox):
'''
This function takes the predicted bounding box and ground truth bounding box and
return the IoU ratio
'''
x_topleft_gt, y_topleft_gt, x_bottomright_gt, y_bottomright_gt= gt_bbox
x_topleft_p, y_topleft_p, x_bottomright_p, y_bottomright_p= pred_bbox

if (x_topleft_gt > x_bottomright_gt) or (y_topleft_gt> y_bottomright_gt):
raise AssertionError("Ground Truth Bounding Box is not correct")
if (x_topleft_p > x_bottomright_p) or (y_topleft_p> y_bottomright_p):
raise AssertionError("Predicted Bounding Box is not correct",x_topleft_p, x_bottomright_p,y_topleft_p,y_bottomright_gt)


#if the GT bbox and predcited BBox do not overlap then iou=0
if(x_bottomright_gt< x_topleft_p):
# If bottom right of x-coordinate GT bbox is less than or above the top left of x coordinate of the predicted BBox

return 0.0
if(y_bottomright_gt< y_topleft_p): # If bottom right of y-coordinate GT bbox is less than or above the top left of y coordinate of the predicted BBox

return 0.0
if(x_topleft_gt> x_bottomright_p): # If bottom right of x-coordinate GT bbox is greater than or below the bottom right of x coordinate of the predcited BBox

return 0.0
if(y_topleft_gt> y_bottomright_p): # If bottom right of y-coordinate GT bbox is greater than or below the bottom right of y coordinate of the predcited BBox

return 0.0


GT_bbox_area = (x_bottomright_gt - x_topleft_gt + 1) * ( y_bottomright_gt -y_topleft_gt + 1)
Pred_bbox_area =(x_bottomright_p - x_topleft_p + 1 ) * ( y_bottomright_p -y_topleft_p + 1)

x_top_left =np.max([x_topleft_gt, x_topleft_p])
y_top_left = np.max([y_topleft_gt, y_topleft_p])
x_bottom_right = np.min([x_bottomright_gt, x_bottomright_p])
y_bottom_right = np.min([y_bottomright_gt, y_bottomright_p])


intersection_area = (x_bottom_right- x_top_left + 1) * (y_bottom_right-y_top_left + 1)

union_area = (GT_bbox_area + Pred_bbox_area - intersection_area)

return intersection_area/union_area

Calculate precision and recall

def calc_precision_recall(image_results):
"""Calculates precision and recall from the set of images
Args:
img_results (dict): dictionary formatted like:
{
'img_id1': {'true_pos': int, 'false_pos': int, 'false_neg': int},
'img_id2': ...
...
}
Returns:
tuple: of floats of (precision, recall)
"""
true_positive=0
false_positive=0
false_negative=0
for img_id, res in image_results.items():
true_positive +=res['true_positive']
false_positive += res['false_positive']
false_negative += res['false_negative']

try:
precision = true_positive/(true_positive+ false_positive)
except ZeroDivisionError:
precision=0.0
try:
recall = true_positive/(true_positive + false_negative)
except ZeroDivisionError:
recall=0.0
return (precision, recall)

Returns true positive, false positive and false negative for the batch of bounding boxes for a single image.

def get_single_image_results(gt_boxes, pred_boxes, iou_thr):
"""Calculates number of true_pos, false_pos, false_neg from single batch of boxes.
Args:
gt_boxes (list of list of floats): list of locations of ground truth
objects as [xmin, ymin, xmax, ymax]
pred_boxes (dict): dict of dicts of 'boxes' (formatted like `gt_boxes`)
and 'scores'
iou_thr (float): value of IoU to consider as threshold for a
true prediction.
Returns:
dict: true positives (int), false positives (int), false negatives (int)
"""
all_pred_indices= range(len(pred_boxes))
all_gt_indices=range(len(gt_boxes))
if len(all_pred_indices)==0:
tp=0
fp=0
fn=0
return {'true_positive':tp, 'false_positive':fp, 'false_negative':fn}
if len(all_gt_indices)==0:
tp=0
fp=0
fn=0
return {'true_positive':tp, 'false_positive':fp, 'false_negative':fn}

gt_idx_thr=[]
pred_idx_thr=[]
ious=[]
for ipb, pred_box in enumerate(pred_boxes):
for igb, gt_box in enumerate(gt_boxes):
iou= calc_iou(gt_box, pred_box)


if iou >iou_thr:
gt_idx_thr.append(igb)
pred_idx_thr.append(ipb)
ious.append(iou)
iou_sort = np.argsort(ious)[::1]
if len(iou_sort)==0:
tp=0
fp=0
fn=0
return {'true_positive':tp, 'false_positive':fp, 'false_negative':fn}
else:
gt_match_idx=[]
pred_match_idx=[]
for idx in iou_sort:
gt_idx=gt_idx_thr[idx]
pr_idx= pred_idx_thr[idx]

# If the boxes are unmatched, add them to matches
if(gt_idx not in gt_match_idx) and (pr_idx not in pred_match_idx):
gt_match_idx.append(gt_idx)
pred_match_idx.append(pr_idx)

tp= len(gt_match_idx)
fp= len(pred_boxes) - len(pred_match_idx)
fn = len(gt_boxes) - len(gt_match_idx)

return {'true_positive': tp, 'false_positive': fp, 'false_negative': fn}

Finally calculating the mAP using 11 point interpolation technique. You can specify your IoU threshold here else a default value of 0.5 will be used

def get_avg_precision_at_iou(gt_boxes, pred_bb, iou_thr=0.5):

model_scores = get_model_scores(pred_bb)
sorted_model_scores= sorted(model_scores.keys())
# Sort the predicted boxes in descending order (lowest scoring boxes first):
for img_id in pred_bb.keys():

arg_sort = np.argsort(pred_bb[img_id]['scores'])
pred_bb[img_id]['scores'] = np.array(pred_bb[img_id]['scores'])[arg_sort].tolist()
pred_bb[img_id]['boxes'] = np.array(pred_bb[img_id]['boxes'])[arg_sort].tolist()
pred_boxes_pruned = deepcopy(pred_bb)

precisions = []
recalls = []
model_thrs = []
img_results = {}
# Loop over model score thresholds and calculate precision, recall
for ithr, model_score_thr in enumerate(sorted_model_scores[:-1]):
# On first iteration, define img_results for the first time:
print("Mode score : ", model_score_thr)
img_ids = gt_boxes.keys() if ithr == 0 else model_scores[model_score_thr]
for img_id in img_ids:

gt_boxes_img = gt_boxes[img_id]
box_scores = pred_boxes_pruned[img_id]['scores']
start_idx = 0
for score in box_scores:
if score <= model_score_thr:
pred_boxes_pruned[img_id]
start_idx += 1
else:
break
# Remove boxes, scores of lower than threshold scores:
pred_boxes_pruned[img_id]['scores']= pred_boxes_pruned[img_id]['scores'][start_idx:]
pred_boxes_pruned[img_id]['boxes']= pred_boxes_pruned[img_id]['boxes'][start_idx:]
# Recalculate image results for this image
print(img_id)
img_results[img_id] = get_single_image_results(gt_boxes_img, pred_boxes_pruned[img_id]['boxes'], iou_thr=0.5)
# calculate precision and recall
prec, rec = calc_precision_recall(img_results)
precisions.append(prec)
recalls.append(rec)
model_thrs.append(model_score_thr)
precisions = np.array(precisions)
recalls = np.array(recalls)
prec_at_rec = []
for recall_level in np.linspace(0.0, 1.0, 11):
try:
args= np.argwhere(recalls>recall_level).flatten()
prec= max(precisions[args])
print(recalls,"Recall")
print( recall_level,"Recall Level")
print( args, "Args")
print( prec, "precision")
except ValueError:
prec=0.0
prec_at_rec.append(prec)
avg_prec = np.mean(prec_at_rec)
return {
'avg_prec': avg_prec,
'precisions': precisions,
'recalls': recalls,
'model_thrs': model_thrs}

References:

https://www.cl.cam.ac.uk/teaching/1415/InfoRtrv/lecture5.pdf

https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173