Facebook AI’s DETR Applies Transformers to CV Tasks

Original article was published on Deep Learning on Medium

Facebook AI’s DETR Applies Transformers to CV Tasks

Transformers are a deep learning architecture that has gained popularity in recent years, particularly on problems with sequential data such as natural language processing (NLP) tasks like language modelling and machine translation. Transformers have also been extended to tasks such as speech recognition, symbolic mathematics, and reinforcement learning.

To push the ‘Transformer revolution’ into the computer vision field, Facebook this week released Detection Transformers (DETR), a new approach for object detection and panoptic segmentation tasks that uses a completely different architecture than previous object detection systems.

“We present a new method that views object detection as a direct set prediction problem,” explains the Facebook research team. “Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task.”

DETR comprises a set-based global loss that forces unique predictions via bipartite matching and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, it can reason about the relations of the objects and the global image context to directly output the final set of predictions in parallel.

Unlike many other modern detectors, the new model is conceptually simple and does not require a specialized library. When tested on the COCO object detection data set, DETR matches the performance of previous SOTA methods such as the Faster R-CNN baseline.

It’s been over four years since Faster R-CNN was proposed as a SOTA approach in object detection, and new SOTA methods including last year’s new ResNeSt have achieved far better results. DETR’s novelty therefore lies primarily in achieving comparable results to an optimized Faster R-CNN with a simpler architecture.

And although DETR achieves significantly better performance on large objects than Faster R-CNN, it still struggles with small objects, a shortcoming the researchers plan to address in future work.

DETR’s design is not only straightforward to implement, it can also be easily extended to panoptic segmentation with competitive results, say the researchers. The team hopes to help improve the interpretability of computer vision models by applying Transformers to object detection tasks.

The paper End-to-End Object Detection with Transformers is on arXiv.