Original article was published on Deep Learning on Medium
Learning Scale-Permuted Backbone for Recognition and Localization
The current recognition and localization networks  have a CNN Back-Bone to identify the promising features from the input image.The first few layers of the network learn to detect general features, such as edges and color spots.
With traditional convolutional neural networks, various convolution operations will always be performed in the middle layer to perform certain features extraction. However, in the extraction process, there will always be some feature loss caused by the reduction of pixels after convolution.
We tend to scale down the resolution of our feature maps and to increase the number of feature Maps .If there is an object on the image, it should not depend the exact location in pixels of the object even if we down sample the image, the network will always recognize if there is an object.
In the low level features we find edges and low-level shapes as you go upacross the network, features becomes more abstract but less localized, the bottom layer can extract more detailed features of small objects, but at the same time there is almost no location information.
The common practice now is to establish a decoder to restore the feature resolution, and at the same time, a multi-scale cross-layer connection is also introduced between the decoder and the encoder to generate a better multi-scale feature map.
We force the bottleneck network to learn high-level features to help with the pixel segmentation ,they have some skip connections from the layers that are of the same size to recover these high level features
Google research proposes SpineNet  that combines the encoder and decoder into one to build a variable-sized backbone.
In the typical architecture we start with high resolution and as we deepen through the layers the resolution gets smaller and the number of features gets higher .
They have built a new Back-Bone network by restricting themselves to using layer permutations and connecting only the current layer to the previous layer.The network can be viewed as a result of searching for the arrangement and connection of different feature blocks on the basis of a general ResNet-50 network.
Taking ResNet-50 as a benchmark, and using the bottleneck blocks in ResNet-50 as candidate feature blocks in the search space, and then searching for the arrangement of feature blocks and the two input connection relationships of each feature block.
They use Neural Architecture Search (NAS) ,they initialize a reinforcement learning agent that decides on the ordering and on the connections in some action space ,she proposes a couple of architectures then they measure all of them and the agent get loss as reward signal .
Just like NAS-FPN , five output feature blocks were selected from layer 3 to layer 7 to generate the final P3-P7 multi-scale feature layer. The remaining feature blocks are regarded as intermediate feature blocks.
The backbone network will be gradually reduced after the search space is completely lifted, and the proportion of variable-scale networks is gradually stretched. And if the number of blocks is no longer limited, then the ratio of L2, L3, L4, L5, L6, L7 will also change, and the final trained model is SpineNet-49.
The biggest barrier to cross-scale fusion is that the resolution and dimensions of the two fusion layers are likely to be completely different. At this time, a new solution is needed to solve this problem.
Scale scaling and channel scaling need to be considered when cross-layering introduces a scale factor α (default is 0.5) to adjust the input feature size C to α×C, and then use the nearest neighbor difference value up-sampling or down-sampling to match the target Resolution.
Finally, apply 1×1 convolution to match the number of channels of the input feature map α×C with the number of channels of the target feature map.
The author started to further explore SpineNet-49, after weighing the delay performance, a total of three network structures of the same family were derived. They are named SpineNet-49s, SpineNet-96, and SpineNet-143 respectively. Among them, SpineNet-49s resides in the same structure as 49, but the feature size is only 75% of the original, while SpineNet-96 repeats the original structure block to deepen the depth, SpineNet-143 is the most complex they use a total of three times the original structure.
From the experimental results, SpineNet has achieved very good results, it can be used for detection, instance segmentation, and classification.Compared with ResNet-FPN or Mask R-NN, it has a certain improvement.
The cross-layer linking effect of learning scale-variable networks is better than that of learning fixed-scale arrangement networks.Through the introduction of NAS multi-scale fusion and even the selection and adjustment of modules, the backbone model is more reasonable, without relying on manual prior design.
- Xianzhi Du et al. : SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization,arXiv:1912.05027
- Zhengxia Zou, Zhenwei Shi, Yuhong Guo, Jieping Ye: Object Detection in 20 Years: A Survey.
- Zewen Li et al. : A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects
- Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, Quoc V. Le:NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun:Deep Residual Learning for Image Recognition.
- Thomas Elsken, Jan Hendrik Metzen, Frank Hutter: Neural Architecture Search: A Survey.