DETR, Object detection with Transformer

Original article was published on Deep Learning on Medium

FAIR published first object detection model named DETR (DEtection TRansformer) which adopt transformer as part of detection structure on May 2020. The paper “End-to-End Object Detection with Transformers” can be found here.

For those who aren’t familiar with Transformer, please check this article:


The architecture of DETR

The architecture of DETR has three main components, which are a CNN backbone to extract a compact feture representation, encoder-decoder transformer, Feed-Forward Netoworks.

After feature extractions by CNN, 1×1 convolution will reduced the channel dimension of final outputs of CNN. Since transformer is permutation invariant, the fixed positional encoding will be supplement before input transformer encoder.

Transformer decoder is difference with the originals. For N inputs, it decodes N outputs in parallel instead of decodes one element at time. The final predictions will compute by Feed Forward Network(FFN). The FFN predicts the center coordinates(normalized), height and width, and the linear layer predicts the class by softmax function.

What’s New

Besides the transformer part in architecture, DETR also adopt two major components from previous research.

  • Bipartite Matching Loss
  • Parallel Decoding

Bipartite Matching Loss

Loss in DETR is the sum of bipartite matching loss

Unlike other object detection models label bounding boxes (or point, like methods in object as points) by matching multiple bounding boxes to one ground truth box, DETR is using bipartite matching, which is one-vs-one matching.

By performing one-vs-one matching, its able to significantly reduce loew-quality predictions, and achieve eliminations of output reductions like NMS.

Bipartite matching loss is designed based on Hungarian algorithm. Won’t go over detail here, please check the paper for further informations.

Parallel Decoding

As mentioned above, transformer decoder decodes N outputs in parallel instead of decodes one element at time.


It outperformed the SoTA (in 2015) model, Faster R-CNN!

The performance of DETR was compared with Faster R-CNN in COCO dataset. To be honest, compared with a SoTA published years ago seems not quite fair.

However, its undoubtedly a big step in Object Detection field. After published of transformer, researchers tried a lot to reasonably implement transformer into computer vision models. But charecteristics of transformer isn’t suitable for two-dimension(image) input. DETR achieves it by extracting features from CNN, and changing the final output of CNN into one-dimension data. This implementation of transformer is not only reasonable, its brilliant.