Original article was published on Artificial Intelligence on Medium
Object Detection: Inspecting your dataset first
When we are working on a public dataset, we take things for granted as the dataset is probably well constructed. And if it is not, someone would have spotted it and had suggested a fix. Take the Pascal VOC dataset as an example, you would do what most people do, that is to combine the 2007 and 2012 dataset. You would use Train+Val as the training set and use the Test as your validation set during training and jump right into training your latest Object Detection mode.
But what if you curated your own dataset, or your customer provided the dataset? We should take a look at it and identify any issues early, even before training your model. I used to do that using Bash scripts and sometimes even Perl (anyone still remembers Perl?). Or you could just use Python, like anyone else. Better still, use a python library such as Pandas to visualize.
The very first customer I worked with on a custom model provided me the dataset. I requested that they deliver the dataset like the Pascal VOC format using a great tool VoTT by Microsoft. I’ll write another post on using VoTT and how we can split the workload to multiple engineers and still easily assemble the dataset. In the beginning, there is a lot of to and fro with the customer on the quality of the dataset and eventually, I wrote some simple scripts to create a report of the dataset to see if it is good enough for the project.
In this article, we will use the good old Pascal VOC dataset as an example.
The whole script is available as a Jupyter notebook you can download.
VOC2007 and VOC2012
You would need to download the VOC2007 and VOC2012 dataset from here. You would then have both these datasets in a directory called VOCDevKit.
Creating a CSV file of all the detections
A little disclaimer here, I adapted using this script from the author of Yolo which you can find a copy here.
In this script, I extracted all the bounding boxes and create a CSV with the following headers:-
year, imageset, imagename, imagewidth, imageheight, class, difficulty, pose, truncated, xmin, xmax, ymin, ymax
Each line of the CSV file contains one bounding box.
Import CSV into Pandas
Once we get the CSV file created, we can import it into pandas and do some visual analysis. But first, a good practice is to print out of head() and info() to allow us to have a quick understanding of the CSV file we created earlier. You will see that each line is a detection of an object with the coordinates of the bounding box, class, difficult level, pose, and truncated(or not). You will also see the year of the image set and the image name. Since each image can contain more than one object, you may see the same image appearing on multiple lines.
We could do a quick count on how many images per image set. Notice we use the drop_duplicates(‘imagename’) to get the correct count.
Next, we look at the split of images for training, validation, and test set across all years. Most people who use Pascal VOC dataset would have train and val are lumped together as training set while the test is used as validation set during the training of the model. This allows us to increase the number of training samples.
We could also look across all detections and check the number of objects per class. Here, we can see that the Pascal VOC dataset have overwhelming number of person class vs all other classes. You would probably want to do some image pre-processing using augmentation to increase the training samples of the other classes. And if this is a custom dataset, you might want to collect more data to make the classes more balanced.
As a further check on individual class, we could look at the split of train/val/test for each class. We want to have a good split to ensure our validation set is a good representation of the training set across all classes.
And finally, for any new datasets, you would want to have a sanity check on the labeling and bounding boxes. In the script, you could randomly select an image by changing objToInspect.
This script is by no means an exhaustive check on your dataset, but I hope it helps you create a base for your own checks of your dataset.