Source: Deep Learning on Medium
Advancements in computer vision hold many promising applications such as self-driving cars or medical diagnosis. In these tasks, we rely on the machine’s ability to recognize objects.
There are four tasks related to object recognition we often see: classification and localization, object detection, semantic segmentation, and instance segmentation.
In classification and localization, we are interested in assigning the class label to the object in the image and drawing a bounding box around the object. In this task, the number of objects to be detected is fixed.
Object detection differs from classification and localization because here, we do not make assumptions on the number of objects in the image beforehand. We start with a fixed set of object categories and we aim to assign the class label and draw the bounding box each time an object in these categories appears in the image.
In semantic segmentation, we assign a class label to each image pixel: all pixels belonging to the grass are labeled “grass”, those belonging to sheep are labeled “sheep”. Notably, this task does not make the difference between two sheep, for example.
Our task in this assignment is instance segmentation which builds on both object detection and semantic segmentation. As in object detection, we aim to label and localize all instances of objects in predefined categories. However, instead of generating bounding boxes for detected objects, we go further by identifying which pixels belong to the object, like in semantic segmentation. The difference with semantic segmentation is that instance segmentation draws a separate mask for each object instance, while semantic segmentation will use the same mask for all instances of the same class
In this article, we will train an instance segmentation model on a tiny Pascal VOC dataset with only 1,349 images for training, and 100 images for testing. The main challenge here will be to prevent the model from overfitting without using external data.
You can find the datasets used and the full training and inference pipeline on Github.
The annotations are in the COCO format so we can use functions from pycocotools to retrieve class labels and masks. In this dataset, there are 20 categories in total.
Below are some visualizations of the training images and the associated masks. Different shades of the masks represent separate masks for several instances of the same object category.
The images are of varying size and aspect ratios so before feeding the images into the model, we resize each image to have dimension 500×500. When the image dimensions are smaller than 500, we upscale the image so that the largest side is of length 500, and add zero paddings as necessary to obtain square images.
For the model to generalize well, especially on a limited dataset such as this one, data augmentation is key to overcome overfitting. For each image, a horizontal flip is performed with probability 0.5, the image is randomly cropped to a scale between 0.9 and 1 times the original dimension, a Gaussian blur with random standard deviation is performed with probability 0.5, the contrast is adjusted by a scale between 0.75 and 1.5, the brightness is adjusted by a scale between 0.8 and 1.2, and a series of random affine transformations are also applied such as scaling, translation, rotation, and shearing.
We will use matterport’s implementation of Mask-RCNN for training. Though tempting, we will not use their pre-trained weights for MS COCO to show how we can obtain good results using only 1,349 training images.
Mask-RCNN was proposed in the Mask-RCNN paper in 2017 and it is an extension of Faster-RCNN by the same authors. Faster-RCNN is widely used for object detection in which the model generates bounding boxes around detected objects. Mask-RCNN takes it a step further by generating the object masks as well.
I will provide a quick overview of the model architecture below, and matterport published a great article that details their model implementation.