Source: Deep Learning on Medium
Part 1: Basic network architecture and repo structure
Hi I’m Hiroto Honda, a computer vision researcher¹ [linkedIn profile].
In this article I would like to share my learnings about Detectron 2 — repo structure, building and training a network, handling a data set and so on.
In 2019 I won the 6th place at Open Images competition (ICCV 2019) using maskrcnn-benchmark, which detectron 2 is based on. It was not an easy task for me to understand the whole framework, so I hope this article helps researchers and engineers who are eager to learn the details of the system and develop their own models.
What’s Detectron 2?
Detectron 2 ² is a next-generation open-source object detection system from Facebook AI Research. With the repo you can use and train the various state-of-the-art models for detection tasks such as bounding-box detection, instance and semantic segmentation, and person keypoint detection.
You can run a demo by following the instructions of the repository — [Installation] and [Getting Started] — but if you want to go further than just running example commands, it would be necessary to know how the repo works.
Faster R-CNN FPN architecture
As an example I choose the Base (Faster) R-CNN with Feature Pyramid Network³ (Base-RCNN-FPN), which is the basic bounding box detector extendable to Mask R-CNN⁴. Faster R-CNN⁵ detector with FPN backbone is a multi-scale detector that realizes high accuracy for detecting tiny to large objects, making itself the de-facto standard detector (see Fig. 1).
Let’s look at the structure of the Base R-CNN FPN:
The schematic above shows the meta architecture of the network. Now you can see there are three blocks in it, namely:
- Backbone Network: extracts feature maps from the input image at different scales. Base-RCNN-FPN’s output features are called P2 (1/4 scale), P3 (1/8), P4 (1/16), P5 (1/32) and P6 (1/64). Note that non-FPN (‘C4’) architecture’s output feature is only from the 1/16 scale.
- Region Proposal Network: detects object regions from the multi-scale features. 1000 box proposals (by default) with confidence scores are obtained.
- Box Head: crops and warps feature maps using proposal boxes into multiple fixed-size features, and obtains fine-tuned box locations and classification results via fully-connected layers. Finally 100 boxes (by default) in maximum are filtered out using non-maximum suppression (NMS). The box head is one of the sub-classes of ROI Heads. For example Mask R-CNN has more ROI heads such as a mask head.
What’s inside each block? Fig. 3 shows the detailed architecture: