Detection and Segmentation through ConvNets


Computer vision — Object detection and segmentation

There are wide variety of applications of neural networks in the realm of computer vision. And with a bit of twist same tools and techniques can be applied across wide range of tasks effectively. In this article we’ll walk through few of those applications and way to approach towards them. The four most common ones are :-

  • Semantic segmentation
  • Classification and localization
  • Object detection
  • Instance segmentation

Semantic segmentation

We input an image and output a decision of category for each individual pixel. In other words, we wish to classify each and every pixel into one of several possible categories. This means, all pixels bearing sheep would be classified into a single category, so are pixels with grass and road. More importantly, the output doesn’t distinguish between two different sheep.

One possible way to approach this problem is to treat it as a classification problem with sliding window. This way we take an input image and break it into several crops of same size. Each crop will then be fed to some CNN to get the classified category for that crop as an output. Crops taken at pixel level would classify each and every pixel. That’s super easy isn’t it.

Semantic segmentation using sliding window

Well, it doesn’t even require a graduate degree to see how computationally inefficient this method would work out to be in practice. What we need is a way to reduce the number of passes for an image at best to single pass. Fortunately, there are techniques of building network with all convolution layers to make prediction for all pixels at once.

Fully convolutional layer for semantic segmentation

As you can see such a network would be a mix of down sampling and up sampling layers so as to preserve the spatial size of input image (to make prediction at pixel level). Down sampling is achieved with the use of strides or max/avg pooling. On the other side, up sampling requires use of some clever techniques, two of them are — nearest neighbor and transposed convolution.

Up sampling techniques

In short, nearest neighbor just duplicates particular element in it’s receptive field (2×2 in above example). On the other hand, transposed convolution strives to learn appropriate weights for filter needed to perform up sampling. Here, we start with top left corner value, which is a scalar, multiply it with the filter and copy those values into the output cells. Then we move the filter some specific pixels in the output in proportion to one pixel movement in input. This, ratio between movement in output and input, would give us the stride we want to use. In case of overlaps we just sum up the values. This way these filters as well constitute the learn-able parameters of these network rather than some fixed set of values as is the case with nearest neighbor. Finally, we can use cross entropy loss at pixel level to train this whole network through back propagation.

Classification and localization

Image classification deals with assigning a category label to an image. But sometimes, in addition to predicting the category, we are also interested in the location of that object in the image. In mathematical terms, we might want to draw a bounding box around that object in the image as shown in picture at the top. Fortunately, we can reuse all the tools and techniques we learned with image classification.

Convolutional network for Classification + Localization

We first feed our input image to some giant ConvNet which will give us scores for each category. But now we have another fully connected layer that predicts the coordinates of bounding box for the object (x, y coordinates of center along with height and width) from the feature map produced by earlier layers. So, our network would produce two outputs, one corresponding to image class and other for bounding box. Henceforth, to train this network we have to account for two losses i.e cross entropy loss for classification and L1/L2 loss (some kind of regression loss) for bounding box predictions.

Broadly, this idea of predicting fixed set of numbers can be applied to wide variety of computer vision tasks other than localization such as human pose estimation.

Human pose estimation

Here, we can define person’s pose by fixed set of points on the body for example joints. Then we would input our image to ConvNet and output same fixed set of point’s (x, y) coordinates. We can then apply some kind of regression loss on each of those points to train the network through back prorogation.

Object Detection

The idea of object detection is that we start with some fixed set of categories we are interested in and anytime any of these category appears in the input image, we will draw the bounding box around that image along with predicting it’s class label. This is different from image classification and localization in a sense that in former we will classify and draw bounding box around only single object. Whereas in latter case we do not know ahead of time how many objects to expect in the image. Again we could apply brute force sliding window approach to this problem as well. However, that would again be computationally inefficient. Instead there are few algorithms developed to solve this problem efficiently — Region proposal based algorithms and YOLO object detection algorithm.

Region Proposals based algorithms

Given an input image, a region proposal algorithm would give thousands of boxes where an object might be present. Surely, there is a possibility of noise in the output like boxes where there are no objects. However, if there is any object in the image it would be selected by the algorithm as a candidate box.

Selective search for region proposals

To make all candidate boxes of same size we need to warp them to some fixed square size that we could eventually feed to network. We, then, can apply a giant ConvNet to each of the candidate box outputted from region proposal to get final category. It would, surely, end up being much more computationally efficient as compared to brute force sliding window algorithm. This is the whole idea behind R-CNN. Now to further reduce the complexity, Fast R-CNN’s are used. The idea behind fast R-CNN is first to get high resolution feature map by passing input image through ConvNet and then impose those region proposals on this feature map instead of actual image. This allows us to reuse lot of expensive convolution computation across the entire image when we have lot of crops.

YOLO (You only look once)

YOLO object detection

The idea behind YOLO is instead of doing independent processing across the proposed regions, make all the predictions at once by re-framing it as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.

We first divide the whole input image into SxS grid. Each grid cell predicts C conditional class probabilities (Pr(Class | Object)) along with B bounding boxes (x, y, w, h) each with a confidence score. The (x, y) coordinates represent the center of the box relative to the bounds of grid cell whereas width and height are predicted relative to the whole image. The probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B. The confidence scores reflect how confident the model is that the box contains an object. If there exist no object in the box then confidence score must be zero. On the other extreme, confidence score should be same as intersection over union (IOU) between predicted box and ground truth label.

                 Confidence score = Pr(Object) * IOU

At test time we multiply the conditional class probabilities and the individual box confidence predictions, which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

      Pr(Class | Object) ∗ (Pr(Object) ∗ IOU) = Pr(Class) ∗ IOU

Instance Segmentation

Instance segmentation employs techniques from both semantic segmentation as well as object detection. Given an image we want to predict the location and identity of objects in that image (similar to object detection), however, rather than predicting bounding box for those objects we want to predict whole segmentation mask for those objects i.e which pixel in the input image corresponds to which object instance. In this we get separate segmentation mask for each of the sheep in the image in contrast to semantic segmentation where all the sheep got the same segmentation mask.

Mask R-CNN for instance segmentation

Mask R-CNN’s are preferred network for this kind of task. In this multi-stage processing task, we pass an input image through a ConvNet and some learned region proposal network. Once we have those region proposals, we project those proposals on to convolutional feature map just like we did in case of R-CNN. However now, in addition to making classification and bounding box predictions, we also predict the segmentation mask for each of those region proposals.

Resources

Please let me know in comments any improvements/modifications this article could accommodate.

Source: Deep Learning on Medium