FCOS: Fully Connected One-Stage Object Detection

Source: Deep Learning on Medium

FCOS: Fully Connected One-Stage Object Detection

Object detection is one of the areas in deep learning where people have been trying to come up with a generalized detection algorithm, which they couldn’t until now. But the state-of-the-art algorithms that have been proposed in recent times were far more superior than their predecessors. The quest for a precision-recall balance and higher IOU score has yielded better results every time. We have been following the popular Feature pyramid Networks and its varieties for about 2 years and the region proposal network with a backbone CNN has shown tremendous progress. Though the architecture is good, it is expensive… very expensive. Why not try another approach? Do everything in one-stage, instead of a 2-stage process like Faster R-CNN (which is state-of-the-art, by the way).

FCOS is a method proposed by ____ in _______ which analogous to semantic segmentation, predicts objects in images in a per pixel prediction fashion. FCOS does not use anchor boxes thereby reducing the number of computations. The method revolves around the concept ofcenter-ness which is a threshold given for each bounding box to be considered for computation. Here, unlike selective search, does not have “4k” coordinates. It highly dependson the center-ness threshold and the number of objects to be detected.

As shown in the left image, FCOS works by predicting a 4D vector (l, t, r, b) encoding the location of a bounding box at each foreground pixel (supervised by ground-truth bounding box information during training). The right plot shows that when a location residing in multiple bounding boxes, it can be ambiguous in terms of which bounding box this location should regress.

It makes use of a Feature Pyramid Network with feature sets {P3,P4,P5,P6,P7} and classes c*. If a location falls into multiple bounding boxes, it is considered as an ambiguous sample. It simply choose the bounding box with minimal area as its regression target.

Network Outputs

Instead of a multiclass classifier, a C binary classifier is used, where C is the number of classes. 4 convolutional layers are added after feature maps of backbone network respectively for classification and regression branches. exp(x) is used to map any real number on top of regression branch as regression targets are always positive. It is worth noting that FCOS has 9× fewer network output variables than the popular anchor-based detectors with “k” anchor boxes per location.

Loss Function

Lcls is the focal loss and Lreg is the IOU loss. Npos denotes the number of positive samples andλ is 1. The summation is calculated over all locations on the feature maps Fi.

The inference phase has the inputs as images given to the network and obtain classification scores px,y and regression prediction tx,y for each location on the feature maps Fi.

Usually, a larger stride (16x) in the final feature maps of a CNN architecture can result in lower Best Possible Recalls (BPR), but here in the FPN-based FCOS, it really doesn’t matter at all. In anchor-based detectors, these issues are compensated through lowering the required IOU scores for positive anchor boxes.

The detection of different sized objects are done in different levels of feature maps.Specifically, 5 levels are used — {P3,P4,P5,P6,P7}. . P3, P4 and P5 are produced by the backbone CNNs’ feature maps C3, C4 and C5 followed by a 1 × 1 convolutional layer with the top-down connections. P6 and P7 are produced by applying one convolutional layer with the stride being 2 on P5 and P6, respectively. As a result, the feature levels P3, P4, P5, P6 and P7 have strides 8, 16, 32, 64 and 128, respectively.

The range for bounding box regression is directly limited here by using — max(l ∗ , t∗ , r∗ , b∗ ) > mi or max(l ∗ , t ∗ , r∗ , b∗ ) < mi−1. Here mi is the maximum distance that feature level i needs to regress. In this work, m2, m3, m4, m5, m6 and m7 are set as 0, 64, 128, 256, 512 and ∞, respectively. Since objects with different sizes are assigned to different feature levels and most overlapping happens between objects with considerably different sizes. If a location, even with multi-level prediction used, is still assigned to more than one ground-truth boxes, we simply choose the ground-truth box with minimal area as its target.

There is still a performance gap between FCOS and anchor-based detectors due to a lot of low-quality predicted bounding boxes produced by locations far away from the center of an object.

For resolving the issue, a single layer branch is added in parallel with the classification branch to predict the center-ness of a location. The center-ness depicts the normalized distance from the location to the center of the object that the location is responsible for.

The center-ness ranges from 0 to 1 and is thus trained with binary cross entropy (BCE) loss. The loss is added to the loss function. A square root is used to slow down the decay of center-ness.

When testing, the final score (used for ranking the detected bounding boxes) is computed by multiplying the predicted center-ness with the corresponding classification score. Thus the center-ness can down-weight the scores of bounding boxes far from the center of an object. As a result, with high probability, these low quality bounding boxes might be filtered out by the final non-maximum suppression (NMS) process, improving the detection performance remarkably.