The End of Anchors — Improving Object Detection Models and Annotations

Source: Deep Learning on Medium


The Problem

If you have ever needed to tinker with anchor boxes, you were probably frustrated, confused and saying to yourself, “There must be another way!” Well now it appears that there is another way.

When I first wrote about anchor boxes, I had the idea of bypassing anchor boxes all together and turning the object detection problem into a semantic segmentation problem. I tried predicting how many object bounding boxes of each class there were for every single pixel. This gave rather poor results due to several problems:

1. I was unable to turn the segmentation masks back into bounding boxes. This meant that I could not isolate or count objects.

2. The neural network struggled to learn which pixels surrounding the object belonged to the bounding box. This could be seen in training where the model first segments the object and only then begins to form a rectangle around it.

Examples from the Common Objects in Context (COCO) dataset¹

Fast forward a few months and there are already several models that have done away with anchor boxes in a much more innovative way.

Anchorless Object Detection

CornerNet² predicts the upper-left and lower-right corners of bounding boxes for every pixel along with an embedding. The embeddings of each corner match up to determine which object they belong to. Once you have matched all of the corners it is trivial to recover the bounding box. This solves the problems I was facing of doing away with anchor boxes and at the same time being able to recover bounding boxes from the output.

CornerNet, however, did not solve the second problem I was facing of the network having to learn where to locate a pixel that does not even contain the object. This is where ExtremeNet³ comes in. This approach is based on CornerNet, however instead of predicting corners, it predicts the center of objects as well as farthest left, right, top and bottom points. These “extreme points” are then matched based on their geometry giving excellent results:

Annotations

But how do I find extreme point annotations? Unfortunately, the authors extracted the extreme points from segmentation masks, an option that is not available if you only have access to bounding boxes. However, extreme point annotations have several advantages:

1. It is much easier to label the extreme points of an object than the bounding box. To label a bounding box, you must eye-ball where the corner is that aligns with the two most extreme points. Usually the annotator must adjust the bounding box afterwards to line up with the extreme points anyways. The ExtremeNet paper estimates that labeling extreme points can take almost a fifth the amount of time it takes to label bounding boxes.

2. For the same reason that it is easier for a human annotator to label the extreme points, it should also be easier for a neural network to point out exactly where they are located. Currently, ExtremeNet appears to outperform other single-shot detectors.

3. Given the extreme points of an object, it is trivial to generate the bounding box. If the image is also included, the DEXTR⁴ algorithm can be used to generate segmentation masks. This makes extreme points much more versatile than bounding boxes.

Not so Fast

Before you migrate all your object detection systems to ExtremeNet, keep in mind that the current implementation is big and slow. The pretrained model that achieved state-of-the-art results has close to 200 million parameters and takes up 800mb. When I test batches of single images on a Tesla K80 GPU (p2.xlarge EC2 instance), it takes about 8 seconds per image using the demo.py script from the GitHub page. Using DEXTR to create a segmentation mask adds another 2–12 seconds depending on the number of detections. If you want to group detections from multiple image scales, the number grows even further. In total it took up to a minute to run multi-scale segmentation on a single image, although I imagine these numbers could be improved by optimizing the code. CornerNet claims to run at around 4 fps on a Titan X (Pascal) GPU, which is also much slower than existing single shot detectors. Before I would consider ExtremeNet as a viable option for most applications, there would need to be a version that is both fast and accurate.

Conclusion

The ExtremeNet paper claims superior performance compared to all other single shot detectors and at the same time does away with having to carefully tune the anchor boxes and then decode the output. Even if there were no such model, the annotation method alone offers significant benefits. The model is still not learnt completely end-to-end as a neural network and includes a variety of algorithms to generate bounding boxes or segmentation masks. Nonetheless, I have great hope that in the future, extreme point annotations and models become the norm in object detection.

References

[1] Lin, Tsung-Yi et al. “Microsoft COCO: Common Objects in Context.” Lecture Notes in Computer Science (2014): 740–755. Crossref. Web.

[2] Law, Hei, and Jia Deng. “CornerNet: Detecting Objects as Paired Keypoints.” Lecture Notes in Computer Science (2018): 765–781. Crossref. Web.

[3] Zhou, Xingyi, Jiacheng Zhuo, and Philipp Krähenbühl. “Bottom-up Object Detection by Grouping Extreme and Center Points.” arXiv preprint arXiv:1901.08043 (2019).

[4] Maninis, K.-K. et al. “Deep Extreme Cut: From Extreme Points to Object Segmentation.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018): n. pag. Crossref. Web.