Source: Deep Learning on Medium
2. Fast R-CNN
💡 So the next idea from the same authors: Why not create convolution map of input image and then just select the regions from that convolutional map? Do we really need to run so many convnets? What we can do is run just a single convnet and then apply region proposal crops on the features calculated by the convnet and use a simple SVM/classifier to classify those crops.
From Paper: Fig. illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.
So the basic idea is to have to run the convolution only once in the image rather than so many convolution networks in R-CNN. Then we can map the ROI proposals using some method and filter the last convolution layer and just run a final classifier on that.
This idea depends a little upon the architecture of the model that gets used too.
So the architecture that the authors have proposed is:
We experiment with three pre-trained ImageNet  networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations. First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16). Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 categories and category-specific bounding-box regressors). Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.
Don’t worry if you don’t understand the above. This obviously is a little confusing, so let us break this down. But for that, we need to see VGG16 architecture first.
The last pooling layer is 7x7x512. This is the layer the network authors intend to replace by the ROI pooling layers. This pooling layer has got as input the location of the region proposal(xmin_roi,ymin_roi,h_roi,w_roi) and the previous feature map(14x14x512).
Now the location of ROI coordinates is in the units of the input image i.e. 224×224 pixels. But the layer on which we have to apply the ROI pooling operation is 14x14x512.
As we are using VGG, we have transformed the image (224 x 224 x 3) into (14 x 14 x 512) — i.e. the height and width are divided by 16. We can map ROIs coordinates onto the feature map just by dividing them by 16.
In its depth, the convolutional feature map has encoded all the information for the image while maintaining the location of the “things” it has encoded relative to the original image. For example, if there was a red square on the top left of the image and the convolutional layers activate for it, then the information for that red square would still be on the top left of the convolutional feature map.
What is ROI pooling?
Remember that the final classifier runs for each crop. And so each crop needs to be of the same size. And that is what ROI Pooling does.
In the above image, our region proposal is (0,3,5,7) in x,y,w,h format.
We divide that area into 4 regions since we want to have an ROI pooling layer of 2×2. We divide the whole area into buckets by rounding 5/2 and 7/2 and then just do a max-pool.
How do you do ROI-Pooling on Areas smaller than the target size? if region proposal size is 5×5 and ROI pooling layer of size 7×7. If this happens, we resize to 35×35 just by copying 7 times each cell and then max-pooling back to 7×7.
After replacing the pooling layer, the authors also replaced the 1000 layer imagenet classification layer by a fully connected layer and softmax over K + 1 categories(+1 for Background) and category-specific bounding-box regressors.
What is the input to a Fast- RCNN?
Pretty much similar to R-CNN: So we have got an image, Region Proposals from the RPN strategy and the ground truths of the labels (labels, ground truth boxes)
Next, we treat all region proposals with ≥ 0.5 IoU(Intersection over Union) overlap with a ground-truth box as a positive training example for that box’s class and the rest as negative. This time we have a dense layer on top, and we use multi-task loss.
So every ROI becomes a training example. The main difference is that there is a concept of multi-task loss:
A Fast R-CNN network has two sibling output layers.
The first outputs a discrete probability distribution (per RoI), p = (p0, . . . , pK), over K + 1 categories. As usual, p is computed by a softmax over the K+1 outputs of a fully connected layer.
The second sibling layer outputs bounding-box regression offsets, t= (tx, ty, tw, th), for each of the K object classes. Each training RoI is labelled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labelled RoI to jointly train for classification and bounding-box regression
Where Lcls is the softmax classification loss and Lloc is the regression loss. u=0 is for BG class and hence we add to loss only when we have a boundary box for any of the other class.
Region proposals are still taking up most of the time. Can we reduce the time taken for Region proposals?