Original article was published on Deep Learning on Medium
Object Detection using GluonCV
In this article, we will walk through to show how to use a pre-trained model for object detection using GluonCV.
- Import Libraries
We will start by importing required libraries. We’ll need to import MXNet, GluonCV, and Pyplot.
2. Test Image
We will use the following image for object detection. The image have a few obvious objects. In the foreground, we have a dog that’s just in front of a bike. And in the background, we have a tree and a car. We’d like a model to detect these objects.
3. Load the Image
So let’s load the image using imread().
4. Transform the Image
As seen above, the image have a data layout of HWC. Our image has a height of 576 pixels with a width of 768 pixels. And it’s a colored image with three channels. So let’s transform our image into the required format.
CV provides a function that applies all of the necessary preprocessing steps for the yolo network. We call yolo.transform_test with our image and also provide the short length of the output image with the short parameter. Our input mage was landscape where the height is smaller than the width. Using this function, the height will be resized to 512 pixels while maintaining the aspect ratio of the image.
The transform test function returns two objects. Our first object is the transformed image that is ready to be given to the network. Our second object is just a resized version of the image, and we use this image for plotting our results.
The resized image is a batch of a single image. This is in NCHW format instead of NHWC format and an array of 32-bit floats instead of 8-bit integers. Finally, the returned resize image is a normalized image.
We can plot the resized CHW image.
We can see the effect of the resize. Our short edge is now 512 pixels instead of 576 while the width remains one-third times longer than the height.
5. Load Pretrained Model
We can use the get_model() function to load our pretrained model from the CV model zoo. We’ll use the yolo3 network with a darknet53 backbone that has been trained on the coco dataset.
Don’t forget to set the pretrained argument to true.
6. Make Prediction
We can call network just like a function once again. Give network and image and a prediction will be returned. When using detection models, we can expect three MXNet ndarrays to be returned. We can loop through the tuple and print out the shape of these arrays.
- The first array contains the object class indexes.
- The second array contains the object class probabilities.
- The last array contains the object bounding box coordinates.
Notice how the shape of each of these arrays starts with a 1, 100. This is because our model can predict up to 100 objects in a single image. So for the first array, with a shape of 1, 100, 1 which means that we have 1 image, 100 potential objects, and 1 class index per object.
And for the last array, with shape 1, 100, 4, we have 1 image, 100 potential objects. And 4 values for each object to define its bounding box.
Since we’re only performing object detection on one image, we can remove the additional batch dimension for all of the arrays and then unpack the tuple. We will give each array its own variable.
7. Object Class
Let’s take a closer look at the object class_indexes. Although our model can potentially detect 100 objects per image, let’s just take a look at the class indexes for the first ten objects.
Our first detected object has a predicted class of 16, and we see more objects with classes 1, 7, 2, 13 and 0. After this, we have a number of objects with a class index of -1.
-1 is a special class index that is used to indicate there is no detected object.
Therefore, we have six detected objects in total, with the remaining 94 potential objects being padded with -1 values. We can use classes properties of the network to look up for the class labels. A top object was class 16 and while looking at the class label we can see that dog has the index 16.
8. Object Probabilities
Similar to the object class_indexes, we can get the associated object class probability. We can interpret this as our confidence that the class index is correct.
If we use a confidence threshold of 50%, we can see that three objects have been detected. Our model is very confident in two of its detections, with probability scores in the high 90s. These could be the two foreground objects. We’ll see -1 again. We don’t have a confidence score for padded objects.
9. Bounding Box Coordinate
Four values are used to define the bounding box of each object. The coordinates for the top-left corner, and the bottom-right corner of the bounding box are given which is four values in total.
Instead of interpreting the bounding boxes from a table let us do the visualization. CV comes with a bounding box plot function. We can provide the resized image (chw_image) from earlier. And also each of the network outputs. Optionally, we can provide the class labels to add annotations to our plot.
The pre-trained network has done a good job of detecting objects in the image. It sucessfully detected a dog, a bike and a truck (with 50% confidence interval). The pre-trained network missed the tree in the background because it was pretrained on coco and coco doesn’t have an object class for trees.
In conclusion, we started by preprocessing the input image. We then loaded an object detection model from the model Zoo and used it to generate a prediction. And finally, we interpreted the network outputs and visualized the detected object