Learning Scale-Permuted Backbone for Recognition and Localization

Original article was published on Deep Learning on Medium

Learning Scale-Permuted Backbone for Recognition and Localization

The current recognition and localization networks [2] have a CNN Back-Bone to identify the promising features from the input image.The first few layers of the network learn to detect general features, such as edges and color spots.

With traditional convolutional neural networks, various convolution operations will always be performed in the middle layer to perform certain features extraction. However, in the extraction process, there will always be some feature loss caused by the reduction of pixels after convolution.

VGG-16 structure. Source

We tend to scale down the resolution of our feature maps and to increase the number of feature Maps .If there is an object on the image, it should not depend the exact location in pixels of the object even if we down sample the image, the network will always recognize if there is an object.

In the low level features we find edges and low-level shapes as you go upacross the network, features becomes more abstract but less localized, the bottom layer can extract more detailed features of small objects, but at the same time there is almost no location information.

Visualization of the Feature Maps Extracted From Blocks in the VGG16 Model. Source

The common practice now is to establish a decoder to restore the feature resolution, and at the same time, a multi-scale cross-layer connection is also introduced between the decoder and the encoder to generate a better multi-scale feature map.

The network contains layers of symmetric convolution (encoder) and deconvolution (de-coder). Skip shortcuts are connected every a few layers from convolutional feature maps to their mirrored deconvolutionalfeature maps. The response from a convolutional layer is directly propagated to the corresponding mirrored deconvolutional layer, both forwardlyand backwardly. Source

We force the bottleneck network to learn high-level features to help with the pixel segmentation ,they have some skip connections from the layers that are of the same size to recover these high level features

Google research proposes SpineNet [1] that combines the encoder and decoder into one to build a variable-sized backbone.

An example of scale-decreased network (left) vs. scale-permuted network (right). The width of block indicates feature resolution and the height indicates feature dimension. Dotted arrows represent connections from/to blocks not plotted.Source[1]

In the typical architecture we start with high resolution and as we deepen through the layers the resolution gets smaller and the number of features gets higher .

They have built a new Back-Bone network by restricting themselves to using layer permutations and connecting only the current layer to the previous layer.The network can be viewed as a result of searching for the arrangement and connection of different feature blocks on the basis of a general ResNet-50 network.

Taking ResNet-50 as a benchmark, and using the bottleneck blocks in ResNet-50 as candidate feature blocks in the search space, and then searching for the arrangement of feature blocks and the two input connection relationships of each feature block.

They use Neural Architecture Search (NAS) ,they initialize a reinforcement learning agent that decides on the ordering and on the connections in some action space ,she proposes a couple of architectures then they measure all of them and the agent get loss as reward signal .

Just like NAS-FPN [4], five output feature blocks were selected from layer 3 to layer 7 to generate the final P3-P7 multi-scale feature layer. The remaining feature blocks are regarded as intermediate feature blocks.

Table 1:Number of blocks per level for stem and scale-permuted networks.The scale-permuted network is built on topof a scale-decreased stem network as shown in Figure 4. The sizeof scale-decreased stem network is gradually decreased to showthe effectiveness of scale-permuted network.

The backbone network will be gradually reduced after the search space is completely lifted, and the proportion of variable-scale networks is gradually stretched. And if the number of blocks is no longer limited, then the ratio of L2, L3, L4, L5, L6, L7 will also change, and the final trained model is SpineNet-49.

The biggest barrier to cross-scale fusion is that the resolution and dimensions of the two fusion layers are likely to be completely different. At this time, a new solution is needed to solve this problem.

Figure 5: Resampling operations.Spatial resampling to up sam-ple (top) and to downsample (bottom) input features followed by resampling in feature dimension before feature fusion .Source [1]

Scale scaling and channel scaling need to be considered when cross-layering introduces a scale factor α (default is 0.5) to adjust the input feature size C to α×C, and then use the nearest neighbor difference value up-sampling or down-sampling to match the target Resolution.

Finally, apply 1×1 convolution to match the number of channels of the input feature map α×C with the number of channels of the target feature map.

The author started to further explore SpineNet-49, after weighing the delay performance, a total of three network structures of the same family were derived. They are named SpineNet-49s, SpineNet-96, and SpineNet-143 respectively. Among them, SpineNet-49s resides in the same structure as 49, but the feature size is only 75% of the original, while SpineNet-96 repeats the original structure block to deepen the depth, SpineNet-143 is the most complex they use a total of three times the original structure.

Figure 4: Building scale-permuted network by permuting ResNet.From (a) to (d), the computation gradually shifts from ResNet-FPN to scale-permuted networks. (a) The R50-FPN model, spending most computation in ResNet-50 followed by a FPN, achieves 37.8% AP;(b) R23-SP30, investing 7 blocks in a ResNet and 10 blocks in a scale-permuted network, achieves 39.6% AP; © R0-SP53, investing all blocks in a scale-permuted network, achieves 40.7% AP; (d) The SpineNet-49 architecture achieves 40.8% AP with 10% fewer FLOPs (85.4Bvs. 95.2B) by learning additional block adjustments. Rectangle block represent bottleneck block and diamond block represent residual block. Output blocks are indicated by red border.Source [1]


From the experimental results, SpineNet has achieved very good results, it can be used for detection, instance segmentation, and classification.Compared with ResNet-FPN or Mask R-NN, it has a certain improvement.

The cross-layer linking effect of learning scale-variable networks is better than that of learning fixed-scale arrangement networks.Through the introduction of NAS multi-scale fusion and even the selection and adjustment of modules, the backbone model is more reasonable, without relying on manual prior design.


  1. Xianzhi Du et al. : SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization,arXiv:1912.05027
  2. Zhengxia Zou, Zhenwei Shi, Yuhong Guo, Jieping Ye: Object Detection in 20 Years: A Survey.
  3. Zewen Li et al. : A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects
  4. Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, Quoc V. Le:NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection.
  5. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun:Deep Residual Learning for Image Recognition.
  6. Thomas Elsken, Jan Hendrik Metzen, Frank Hutter: Neural Architecture Search: A Survey.