Digging into Detectron 2 (part 2)

Source: Deep Learning on Medium

Digging into Detectron 2 (part 2)

Feature Pyramid Network

Figure 1. Inference result of Faster (Base) R-CNN with Feature Pyramid Network.

Hi I’m Hiroto Honda, a computer vision researcher¹.

In this article I would like to share my learnings about Detectron 2 — repo structure, building and training a network, handling a data set and so on.

Detectron 2 ² is a next-generation open-source object detection system from Facebook AI Research.

In part 1, I have shown the following detailed schematic diagram (Fig. 1) that depicts the settings of the standard Base-RCNN-FPN network. [config file]
This time, we are going deeper into the Backbone Network — the Feature Pyramid Network³ (FPN).

Figure 2. Detailed architecture of Base-RCNN-FPN. Blue labels represent class names.

The role of the backbone network is to extract feature maps from the input image. Here there are not any bounding boxes, anchors, or loss functions yet!

Input and Output of FPN

Firstly we will clarify input and output of FPN. Figure 2 is the closer look at the FPN schematic.

Figure 3. Detailed architecture of the backbone of Base-RCNN-FPN with ResNet50. Blue labels represent class names. (a), (b) and (c) inside the blocks stand for the bottleneck types detailed in Fig. 5.

input (torch.Tensor): (B, 3, H, W) image

B, H and W stand for batch size, image height and width respectively. Be careful that the order of input color channels is Blue, Green and Red (BGR). If you put an RGB image as input, detection accuracy might drop.

output (dict of torch.Tensor): (B, C, H / S, W / S) feature maps

C and S stand for channel size and stride. By default C=256 for all the scales and S = 4, 8, 16, 32 and 64 for P2, P3, P4, P5 and P6 outputs respectively.

For example, if we put a single image whose size is (H=800, W=1280) into the backbone, the input tensor size is torch.Size([1, 3, 800, 1280]) and the output dict should be:

output["p2"].shape -> torch.Size([1, 256, 200, 320]) # stride = 4 
output["p3"].shape -> torch.Size([1, 256, 100, 160]) # stride = 8
output["p4"].shape -> torch.Size([1, 256, 50, 80]) # stride = 16
output["p5"].shape -> torch.Size([1, 256, 25, 40]) # stride = 32
output["p6"].shape -> torch.Size([1, 256, 13, 20]) # stride = 64

Fig 4. shows what the actual output feature maps look like. One pixel of ‘P6’ feature corresponds to broader area of input image than ‘P2’- in other words ‘P6’ has a larger receptive field than ‘P2’. FPN can extract multi-scale feature maps with different receptive fields.

Figure 4: Example of input and output of FPN. The feature at the 0th channel is visualized from each output.

Code Structure⁴ for Backbone modeling

The related files are under detectron2/modeling/backbone directory:

│ ├─backbone
│ │ ├─backbone.py <- includes abstract base class Backbone
│ │ ├─build.py <- call builder function specified in config
│ │ ├─fpn.py <- includes FPN class and sub-classes
│ │ ├─resnet.py <- includes ResNet class and sub-classes

The following is the class hierarchy.

FPN (backbone/fpn.py)
ResNet (backbone/resnet.py)
│ ├ BasicStem (backbone/resnet.py)
BottleneckBlock (backbone/resnet.py)
LastLevelMaxPool (backbone/fpn.py


ResNet⁵ consists of a stem block and ‘stages’ that contain multiple bottleneck blocks. As for ResNet50, the block structure is :

(res2 stage, 1/4 scale)
BottleneckBlock (b)(stride=1, with shortcut conv)
BottleneckBlock (a)(stride=1, w/0 shortcut conv) × 2
(res3 stage, 1/8 scale)
BottleneckBlock (c)(stride=2, with shortcut conv)
BottleneckBlock (a)(stride=1, w/o shortcut conv) × 3
(res4 stage, 1/16 scale)
BottleneckBlock (c)(stride=2, with shortcut conv)
BottleneckBlock (a)(stride=1, w/o shortcut conv) × 5
(res5 stage, 1/32 scale)
BottleneckBlock (c)(stride=2, with shortcut conv)
BottleneckBlock (a)(stride=1, w/o shortcut conv) × 2

ResNet101 and ResNet152 have larger number of bottleneck blocks (a), defined at: [code link].

(1) BasicStem (stem block) [code link]

The ‘stem’ block of ResNet is quite simple. It down-samples the input image twice by 7×7 convolution with stride=2 and max pooling with stride=2.
The output of the stem block is a feature map tensor whose size is (B, 64, H / 4, W / 4).

- conv1 (kernel size = 7, stride = 2)
- batchnorm layer
- ReLU
- maxpool layer (kernel size = 3, stride = 2)

(2) BottleneckBlock [code link]

The bottleneck block is originally proposed in the ResNet paper⁵. The block has three convolution layers whose kernel sizes are 1×1, 3×3, 1×1 respectively. The input and output channel numbers of 3×3 convolution layer are smaller than the input and output of the block, for efficient computation.

There are three types of bottleneck blocks as shown in Fig.5 :
(a): stride=1, w/o shortcut conv
(b): stride=1, with shortcut conv
(c) : stride=2, with shortcut conv

shortcut convolution (used in (b), (c))

ResNet has identity shortcut that adds the input and the output features. For the first block of a stage (res2-res5), a shortcut convolution layer is used to match the number of channels of input and output.

downsampling convolution with stride=2 (used in (c))

At the first block of the res3, res4 and res5 stages, the feature map is downsampled by a convolution layer with stride=2. A shortcut convolution with stride=2 is also used, because the input channel number is not the same as the output.

Note that a ‘convolution layer’ mentioned above contains convolution torch.nn.Conv2d and normalization (e. g. FrozenBatchNorm⁶). [code link]
ReLU activation is used after convolutions and feature addition (see Fig. 5).

Figure 5. Three types of bottleneck blocks.


FPN contains ResNet, lateral and output convolution layers, up-samplers and a last-level maxpool layer. [code link]

lateral convolution layers [code link]

This layer is called ‘lateral’ convolution because FPN is originally depicted like a pyramid where the stem layer is placed at the bottom (it is rotated in this article). The lateral convolution layers take features from the res2-res5 stages with different channel numbers and returns 256-ch feature maps.

output convolution layers [code link]

An output convolution layer contains 3×3 convolution that does not change number of channels.

forward process [code link]

Figure 6. Zooming into a part of FPN schematic that deals with res4 and res5.

The forward process of FPN starts from the res5 output (see Fig. 6).
After going through the lateral convolution, the 256-channel feature map is named as prev_features and fed to the output convolution, to be registered to the results list.

The prev_features tensor is up-sampled using F.interpolate with nearest neighbor, to be named as top_down_features.

Next, the res4 output goes through the lateral convolution and added to top_down_features. The resulting feature map updates prev_features.
Then the new prev_features tensor is fed to the output convolution and the result tensor is inserted to the results list.

The routine above (from up-sampling to insertion to the results) is carried out three times, and finally the result list contains four tensors — namely P2, P3, P4 and P5.

LastLevelMaxPool [code link]

To make the P6 output, a max pooling layer with kernel size = 1 and stride = 2 is added to the final block of the ResNet. This layer just down-samples P5 features (1/32 scale) to 1/64-scale features to be added to the result list.

To Be Continued…

Now we have obtained multi-scale feature maps. In the next part we will detect object regions using region proposal network. Thank you for reading and please wait for the next part!

part 1 : Introduction — Faster R-CNN FPN architecture and repo structure
part 2 (you are here) : Feature Pyramid Network
part 3 (next story!) : Proposal Generator
part 4 : Box Head