Source: Deep Learning on Medium
Understanding Object Detection
From the history of object detection down to the inner workings of the famous Faster-RCNN
In this article, I want to give you an overview of the history of object detectors and will explain how the architectures evolved to current state-of-the art detectors. Furthermore, I will go into detail about the inner workings of Faster R-CNN, hence it is very widely used and also part of the Tensorflow Object detection API.
Brief History of object detectors
Object detection combines the tasks of object classification and localization. Current object detectors can be divided into two categories: Networks separating the tasks of determining the location of objects and their classification, where Faster R-CNN is one of the most famous ones, and networks which predict bounding boxes and class scores at once, with the YOLO and SSD networks being famous architectures.
The first deep neural network for object detection was Overfeat . They introduced a multi-scale sliding window approach using CNNs and showed that object detection also improved image classification. They were shortly followed by R-CNN: Regions with CNN features . The authors proposed a model that used selective-search for generating region proposals by merging similar pixels into regions. Each region was fed into a CNN, which produced a high dimensional feature vector. This vector was then used for the final classification and bounding box regression, as shown in Figure 1.
It outperformed the Overfeat network by a large margin but was also very slow, because the proposal generation using selective-search was very time-intensive, as well as the need to feed every single proposal through a CNN. A more sophisticated approach, Fast R-CNN , also generated region proposals with selective-search but fed the whole image through a CNN. The region proposals were pooled directly on the feature map by ROI pooling. The pooled feature vectors were fed into a fully connected network for classification and regression, as depicted in Figure 2. Similar to R-CNN, Fast R-CNN generated the region proposals with selective-search.
Faster R-CNN  addressed this issue by proposing a novel region proposal network, that was fused with the Fast R-CNN architecture to drastically speed up the process and will be explained in greater detail in the next section. Another approach to detected object in images was R-FCN , the region fully convolutional network, that used position-sensitive score maps instead of a pre-region subnetwork.
The design of object detection networks was revolutionized by the YOLO  network. It follows completely different approach to the aforementioned models and is capable of predicting class scores and bounding boxes at once. The proposed model divided the image into a grid, where each cell predicted a confidence score of an object being present with the corresponding bounding box coordinates. This allowed YOLO real-time predictions. The authors also released two more versions, YOLO9000 , and YOLOv2, where the former was capable of predicting over 9000 categories, and the latter one was capable of processing larger images. Another network that predicts classes and bounding boxes at once is the single shot detector, SSD . It is comparable to YOLO but used multiple aspect ratios per grid cell and more convolutional layers to improve prediction.
The major problem of Fast R-CNN and R-CNN was the time and resource-intensive generation of region proposals. Faster R-CNN solved this by fusing Fast R-CNN with a region proposal network (RPN), as depicted in Figure 3. The RPN uses the output of the CNN as input and generates region proposals, where each proposal consists of an objectness score, as well as the object’s location. The RPN can be trained jointly with the detection network, speeding up the training and inference time. Faster R-CNN is up to 34 times faster than Fast R-CNN. In the following paragraphs each step of Faster R-CNN is explained in greater detail.
The objective of Faster R-CNN is to detect objects as rectangular bounding boxes. These rectangles can be of varying size and scale. When previous works tried to consider objects of varying size and scale, they either created image pyramids, where multiple sizes of the image were considered, or pyramids of filters, where multiple different sized convolutional filters were applied [1, 9, 10]. These approaches worked on the input image, whereas Faster R-CNN works on the feature map from the output of a CNN and is thus creating a pyramid of anchors. An anchor is a fixed bounding box, that consists of a center point, a specific width and height, and references a bounding box on the original image. A set of anchors, which consists of multiple anchors of different combinations of sizes and scales, is generated for each position from a sliding window on the feature map. Figure 4 shows an example, where a window of size 3×3 generates k anchors, where each anchor has the same center point in the original image. This is possible, because the convolution operation is invariant to translation, and a position on the feature map can be calculated back to a region in an image.
Region Proposal Network
The RPN is a fully convolutional network and directly works on the feature map in order to generate region proposals, as depicted in figure 5. It takes the anchors as input, predicts an objectness score and performs box regression. The former is the likeliness of an anchor being an object or background, and the latter corresponds to the offset from the anchor to the actual box. Therefore, for k anchors, the RPN predicts 2k scores and 4k box regression values. The initial number of anchors does not need to be smaller than from any other region proposal method since the RPN reduces the number of anchors drastically by only considering regions that have a high objectness score. Training the RPN is not trivial. Since this is a supervised learning approach, each anchor has to be labeled either foreground or background. Therefore, every anchor has to be compared to every ground truth object by calculating their intersection over union (IoU). An anchor is considered to be foreground and positive if there exists an IoU with a groundtruth object greater than 0.7. It is considered to be background and negative if the IoU to every groundtruth box is lower than 0.3. All anchors that have an IoU between 0.3 and 0.7 are ignored. The distribution of positive and negative proposals is very imbalanced because there are way more negative than positive proposals per groundtruth box. Therefore, a minibatch with a fixed number of positives and negatives is sampled for the training process. If there are not enough positive proposals, the batch is filled with the proposals, that have the highest IoU for the respective groundtruth box. If there are still not enough positive proposals, they will be padded with negatives. The RPN employs a multitask loss to optimize both objectives simultaneously. For classification, a binary cross-entropy loss and for bounding box regression a smooth L1 loss is used. Architecture of the region proposal network, which is a fully convolutional network.
The regression loss is only calculated for positive anchors. Non maximum suppression (NMS) is applied after prediction to remove region proposals that have an IoU with other region proposals higher than a certain value, but a lower objectness score. After NMS, the top N proposals are selected as the final region proposals.
Region of interest pooling
The next step after the RPN is to use the region proposals in order to predict object classes and localization. Instead of the approach of R-CNN, where each proposal is fed through a classification network, Faster R-CNN makes use of the feature map to extract the features. Since a classifier expects a fixed-size input, fixed-size features are extracted for each region proposal from the feature map. In modern implementations, the feature map is cropped by the region proposal and resized to a fixed size. Max pooling extracts the most salient features afterwards, leading to a fixed size for each region proposal, which is then fed into the final stage.
Classification and Regression
The last step in Faster R-CNN is the classification of the extracted features. Therefore, the pooled features are flattened and input to two fully connected layers, which handle classification and regression. The classification layer outputs N+1 predictions, one for each of the N classes, plus one background class. The regression layer outputs 4N predictions, where each represents the regressed bounding box for each of the N classes. The training of the R-CNN module is comparable to the one of the RPN. A proposal is assigned to each groundtruth box with an IoU of greater than 0.5. Proposals with an IoU between 0.1 and 0.5 are treated as negatives and backgrounds and proposals with an IoU lower than 0.1 are ignored. Random sampling during training creates a mini-batch containing 25 percent foreground and 75 percent background proposals. The classification loss is a multiclass cross entropy loss using all proposals in the mini-batch, whereas the localization loss only uses positive proposals. In order to remove duplicates, class based NMS is applied. The final output is a list of all predicted objects with a probability higher than a specific threshold, mostly 0.5.
There are many different ways to detect objects in images. This blogpost showed the history from slow networks using selective search for generating region proposals to more sophisticated networks, such as Faster R-CNN. If you want to get started with object detection, I recommend the Object Detection API by Tensorflow, which mainly features the Faster R-CNN and SSD architectures. Thanks for reading!