Digging into Detectron 2

Source: Deep Learning on Medium

Part 1: Basic network architecture and repo structure

Figure 1. Inference result of Faster (Base) R-CNN with Feature Pyramid Network.

Hi I’m Hiroto Honda, a computer vision researcher¹ [linkedIn profile].

In this article I would like to share my learnings about Detectron 2 — repo structure, building and training a network, handling a data set and so on.

In 2019 I won the 6th place at Open Images competition (ICCV 2019) using maskrcnn-benchmark, which detectron 2 is based on. It was not an easy task for me to understand the whole framework, so I hope this article helps researchers and engineers who are eager to learn the details of the system and develop their own models.

What’s Detectron 2?

Detectron 2 ² is a next-generation open-source object detection system from Facebook AI Research. With the repo you can use and train the various state-of-the-art models for detection tasks such as bounding-box detection, instance and semantic segmentation, and person keypoint detection.

You can run a demo by following the instructions of the repository — [Installation] and [Getting Started] — but if you want to go further than just running example commands, it would be necessary to know how the repo works.

Faster R-CNN FPN architecture

As an example I choose the Base (Faster) R-CNN with Feature Pyramid Network³ (Base-RCNN-FPN), which is the basic bounding box detector extendable to Mask R-CNN⁴. Faster R-CNN⁵ detector with FPN backbone is a multi-scale detector that realizes high accuracy for detecting tiny to large objects, making itself the de-facto standard detector (see Fig. 1).

Let’s look at the structure of the Base R-CNN FPN:

Figure 2. Meta architecture of Base RCNN FPN.

The schematic above shows the meta architecture of the network. Now you can see there are three blocks in it, namely:

  1. Backbone Network: extracts feature maps from the input image at different scales. Base-RCNN-FPN’s output features are called P2 (1/4 scale), P3 (1/8), P4 (1/16), P5 (1/32) and P6 (1/64). Note that non-FPN (‘C4’) architecture’s output feature is only from the 1/16 scale.
  2. Region Proposal Network: detects object regions from the multi-scale features. 1000 box proposals (by default) with confidence scores are obtained.
  3. Box Head: crops and warps feature maps using proposal boxes into multiple fixed-size features, and obtains fine-tuned box locations and classification results via fully-connected layers. Finally 100 boxes (by default) in maximum are filtered out using non-maximum suppression (NMS). The box head is one of the sub-classes of ROI Heads. For example Mask R-CNN has more ROI heads such as a mask head.

What’s inside each block? Fig. 3 shows the detailed architecture: