Source: Deep Learning on Medium
As part of R & D for a real estate client , I recently worked on a task which involved detecting objects, given an image of specific room type, say — kitchen. One of the main uses is automatic tagging which could in turn be instrumental in generation of better description, search optimization and so on. This could also be put to use in estimating the price of a home — for example , the presence of an island countertop indicates that the kitchen is spacious , and if granite is the material used ; it definitely indicates a costlier kitchen.
As a pre-requisite,some knowledge of Convolution Neural Networks and related concepts would help in appreciating this article better.
In subsequent articles, I will be writing about the pretrained model that was used ,the Faster RCNN model and how I used the Tensorflow Object Detection API to achieve the results.
What is object detection?
As the name goes, it involves detecting the objects in an image along with their location, typically using a bounded box. In my case, I was looking for island countertops, traditional countertops, ovens etc. in a kitchen image , as shown below :
How does it work?
Given an image, the early approaches to object detection take two steps:
- Dividing the image into multiple smaller pieces
- And then passing the pieces into an image classifier which outputs whether the image contains an object or not. If yes, mark the piece in the original image as the detected object.
The Sliding Window algorithm is one way of achieving the first step where a rectangular window is slid through the original image and each of the grid box is used as a smaller piece, like the white box shown in the image below.
This involved generating grids or windows of multiple images taking into consideration various aspect ratios (sizes), angles, shapes and then inputting them into a Conv net for the classification part. The windows formed the bounding boxes. Since this had to be repeated for multiple times, this process was computationally huge.
The next set of algorithms made use of a different approach to localize the objects. Rather than sliding a window through the picture, these methods tried to group similar pixels in an image to form a ‘region’ . This grouping was done using image segmentation. The regions are then fed to a classifier to identify the class present.
A further improvement over image segmentation was claimed by the authors of Selective Search algorithm.
The selective search algorithm emphasized on a hierarchical grouping-based segmentation algorithm, using multiple strategies in place of one (as opposed to just a single run of image segmentation), to cover as many cases as possible.
The hierarchical grouping starts with initial regions; the most similar regions are then merged together to form new regions — this process is continued till the whole image is represented using a single region. The regions in each step are added to the region proposals.The similarity was calculated in terms of alikeness of texture , color and so on.
The region proposal methods like Selective Search strike well in ground truth recall and repeatability, it still lags when speed is important . The authors of the Faster RCNN algorithm term this initial step as a ‘bottleneck’ which takes as much time as the actual object detection. They introduced the Region Proposal Network.
Faster RCNN Algorithm and the Region Proposal Network (RPN)
The algorithms thus far treated the object detection part of the algorithm separate from the region proposals. The outputs of the region proposal methods were fed to a classifier, and the performance of the classifier depended on how well the region proposal method worked.
The authors of Faster RCNN proposed a unified approach for both the tasks in hand — which meant that both the region proposal and classifier shared the same convolutional features.
The RPN outputs the objectness score which indicates whether the piece of image in question (anchor) contains a background or a foreground object, ie. a binary classification task which outputs ‘object’ or ‘no object’.
The classifier network outputs the score for each of the classes that we want to classify the objects into.
From the original paper:
“An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.”
In simple words, the RPN works in the following way –
- The feature map generated by the last conv layer in the shared conv network is taken and a small network is slid over the feature map.
- The output of this fed into two fully connected networks (as shown below)
a. A classification layer
b. Regression layer
3. The red box in the above figure is called an Anchor. Every anchor is associated with a scale and aspect ratio. In the original paper every anchor was associated with 3 aspect ratios and 3 scales, having 9 anchors at every sliding window location.
4. The reg layer outputs the coordinates for each of maximum ‘k’ boxes (4k outputs) whereas the cls layer outputs the probability that each of the ‘k’ boxes contains an object or not (2k outputs)
This network was trained using back propagations and SGD.
I will be explaining more on how this works with the Fast RCNN network , and the loss functions used, in the next post. Thanks for reading,your feedback is much appreciated.