Deep Learning for Object Detection and Localization using Fast R-CNN

Source: Deep Learning on Medium

Deep Learning for Object Detection Part II

A deep dive into the improved R-CNN approach


Deep Learning for Object Detection Part II — A Deep Dive Into Fast R-CNN is the second article in our Deep Learning for Object Detection series, which explores state-of-the-art, region based, object detection methods and their evolution over time. In this piece, we look at innovations that improve training and testing speed to overcome many of the drawbacks of the R-CNN approach.

In Part I, we saw how R-CNN, pioneered by Ross Girshick and his team, sparked new research interest in region based object detection. The following year, they released a new paper detailing how they suped up R-CNN.

We will take a look at how they improved upon R-CNN and what methodologies were employed to develop a deeper understanding of why this method is called Fast R-CNN. We have also provided sample TensorFlow code to help aid understanding of how the tensors are flowing between layers and how spatial pyramid pooling works in action.

R-CNN vs Fast R-CNN

The landscape of computer vision, and more specifically object detection, has changed dramatically since Convolutional Neural Networks (CNNs) were introduced in 2013. As you can see in the graph below, precision began to plateau at 40% until the advent of deep learning for object detection started to take off. Advancement in research coupled with accessibility to faster and more powerful hardware resulted in precision growing exponentially and current state of the art is around 80%.

There was an meteoric rise in precision after CNN’s were used for object detection. Image Source
R-CNN modules. Image source

As we saw last time, R-CNN predominantly uses region proposals using a selective search (SS) approach to pre-compute the priors. The features between these regions are not shared, so we have to extract features individually for each region-of-interest (RoI) generated using a SS approach. This is not a particularly computationally effective approach and makes real-time object detection inefficient.

Fast R-CNN modules Image source

Fast R-CNN, on the contrary, trains a deep `VGG-16` network, 9x faster than R-CNN and is 213x faster at test time, achieving a higher mAP on PASCAL VOC 2012. How does it achieve this humongous gain in speed? Fast R-CNN evaluates the network and extracts features for the whole image once, instead of extracting features from each RoI, which are cropped and rescaled every time in R-CNN. It then uses the concept of RoI pooling, a special case of pyramid pooling used in SPPNet, to give a feature vector of desired length. This feature vector is then used for classification and localization. This method is more effective than R-CNN, because the computations for overlapping regions are shared. Let us unpack the details of the process to gain a deeper understanding.

Architectural Design of Fast R-CNN

Building blocks of Fast R-CNN:

  1. Region proposal network
  2. Feature extraction using CNN
  3. RoI pooling layer — This is where the real magic of Fast R-CNN happens
  4. Classification and Localization
Building blocks of Fast R-CNN Image Source

The Region proposal network and Feature extraction modules are very similar to what we have seen in R-CNN. But instead of passing each cropped and re-scaled RoI, the entire input image is passed through a feature extractor like `VGG-16` to produce a convolutional feature map. The features (i.e convolutional feature map) are combined with Region proposal network, which uses a SS approach, to form a fixed-length feature vector in the RoI pooling layer. Each of these feature vectors is then passed along to classification and localization modules. The classification module classifies K+1 (1 for background) object classes using a softmax probability. The localization module outputs four real-valued numbers for each K object classes.

Spatial Pyramid Pooling

Let’s see what Fast R-CNN does differently that makes it much faster than its predecessor. To understand RoI pooling, it is vital to understanding how SPPNet does spatial pyramid pooling (SPP) to extract fixed length output vectors. SPP is heavily inspired by two widely used concepts in image processing and computer vision:

Image pyramids are traditionally used in computer vision to generate filter-based decomposed representations of the original image to extract useful features at multiple scales. It is also used for storing compressed image representations. Based on this concept, let us take a look at the following figure which uses a 3 level SPP.

A network structure with a spatial pyramid pooling layer. Image Source

Imagine we have a seven layer deep CNN. Let’s say the 5th convolutional layer `conv_5` has a depth (number of filters) of 256, which means it has 256 individual feature maps stacked in z-axis (number of channels) or depth. Normally, this feature map would have to be flattened and connected to a fully-connected layer as some previous CNN architectures did. But what SPP does differently is to decompose this `m x n x 256` (where m and n are height and width of the `conv_5`) into a fixed length/size 1-dimensional vector by concatenating all image decompositions at varying scale using a max-pooling operation.

The coarsest pyramid level (at the very bottom, indicated in white) uses a global pooling operation that covers the entire image. This creates a `256-d` flattened vector. The middle layer (one indicated in green) pools 4 values from each feature map and hence we get `4 x 256-d` flattened vector. The top most (blue) pools 16 values from each feature map and hence we end up with `16 x 256-d` vector. So, in this example, we have a (`256 + 4 x 256 + 16 x 256`) `5376-dimensional` flattened vector that will be fed to our fully connected layer. For a more in-depth understanding of the math, please refer to this github repo.

Need for Spatial Pyramid Pooling

But why do we have to do the spatial pyramid pooling in the first place? You see, CNNs, which have dense connection representations at deeper levels, traditionally require a fixed input size in accordance with the architecture used. For example, it is common practice to use `227 x 227 X 3` for AlexNet. So why do CNNs require a fixed image size? Let’s analyze this quote from the SPPNet paper:

In fact, convolutional layers do not require a fixed image size and can generate feature maps of any sizes. On the other hand, the fully-connected layers need to have fixed size/length input by their definition. Hence, the fixed size constraint comes only from the fully-connected layers, which exist at a deeper stage of the network.

The authors make a very important point that fully connected layers require a fixed image size. Why? As we know, backpropagation is always done with respect to weights and biases. In convolutional layers, these weights are nothing but the filters or feature maps. The number of feature maps is always independent of the image height and width and is decided a priori. Fully connected layers have a fixed length. If we have images of varying spatial dimensions while training, we cannot comply to a fixed flattened vector, because this will lead to dimension mismatch errors. To avoid this, we have to use fixed image dimensions. Resizing and cropping images is not always ideal because the recognition/classification accuracy can be compromised as a result of content and information loss or distortion.

The spatial bins that we use here (`16, 4, 1`) have sizes proportional to image size, so the number of bins is fixed regardless of image size. This means we get rid of the size constraint i.e no cropping and resizing of original images. This significantly increases efficiency because now we do not have to preprocess images in a fixed aspect ratio (i.e. fixed height and width) and will always get a fixed flattened vector regardless of what image size you pass in as input.

In the above image, the input to RoI pooling is `2000×5` tensor of RoI’s selected with SPP and a feature map of depth 512. The 512 is the depth size/number of filters of the previous convolutional layer. Why 5 for RoI representation? It represents [`batch_size, x_min, y_min, x_max, y_max`] of the RoI’s. These RoI’s are combined with the CNN feature maps (shown as depth of 512 in the above image), which means the RoI’s are scaled by a factor of k, where k is the total spatial downsampling that has occured over its journey from layer_1 to present layer (say from 1000 to 100, so the downsampling ratio is 10 on width of image). Then RoI’s are further downsampled by a factor of 1/H X 1/W and spatial pyramid pooling is performed on all 2000 RoI’s. This part was a bit confusing to me at first, but this snippet from Ross Girschick’s original Caffe implementation made things more clear.

Original Github repo. Spatial scale used in original code is 1/16.

Once we have a vector, or a tensor really, of dimensions [`batch_size,width, height, depth`] from RoI pooling, we flatten the vector and add a couple of fully connected layers to it. This flattened vector is then used for calculating both classification and bounding box regression loss.


Fast R-CNN achieves a giant leap in improving accuracy and reducing computation time for training and inference because of its smart use of RoI pooling layer enabling it to train the entire image without cropping multiple RoI’s and training them separately. Here, all parameters including the CNN architecture is trained together with a cross entropy function for the class classification and a smoothL1 loss function for the boundary box prediction. Besides this change, Fast R-CNN is very similar in its architecture to its predecessor.

As with all the architectures, Fast R-CNN is still not perfect and has its own limitations as it still requires region-proposals, which are non trainable. So essentially, it is not an end-to-end trainable architecture. In our next blog, we will look at how this was achieved by researchers who aptly called the next version Faster R-CNN.