Is NAS monopolized? We open-sourced a NAS pipeline outperforming Google, Facebook, and the others

Source: Deep Learning on Medium

Approach breakdown

Our approach is mainly based on the Single Path One-Shot NAS in the combination of Squeeze and Excitation (SE), ShuffleNet V2+ and MobileNet V3. Like the original paper, the choice blocks and block channel scales are searched with multiple FLOPs and parameter amount constraints. In this section, some implementation details will be elaborated.

Supernet Structure Design

For each ShuffleNasBlock, 4 choice blocks were explored, ShuffleNetBlock-3x3 (SNB-3), SNB-5, SNB-7 and ShuffleXceptionBlock-3x3 (SXB-3). Within each block, 8 channel choices are available: [0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] * (BlockOutputChannel / 2). So each ShuffleNasBlock explores 32 possible choices and there are 20 blocks in this implementation, counting for totally 32^20 design choices.

We also applied the SE, ShuffleNet V2+ SE layout and the MobileNet V3 last convolution block design in the supernet. Finally, the supernet contains 15.4 Million trainable parameters and the possible subnet FLOPs range from 168M to 841M.

Supernet Training

Unlike what the original Single Path One-Shot NAS did, in the training stage, we didn’t apply uniform distribution from the beginning. The supernet was trained totally 120 epochs. In the first 60 epochs only Block selection was applied and, for the upcoming 60 epochs, a new approach Channel Selection Warm-up was introduced, which gradually allows the supernet to be trained with a larger range of channel choices.

# Supernet sampling schedule: during channel selection warm-up
1 - 60 epochs: Only block selection (BS)
61 epoch: [1.8, 2.0] + BS
62 epoch: [1.6, 1.8, 2.0] + BS
63 epoch: [1.4, 1.6, 1.8, 2.0] + BS
64 epoch: [1.2, 1.4, 1.6, 1.8, 2.0] + BS
65 - 66 epochs: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0] + BS
67 - 69 epochs: [0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] + BS
70 - 73 epochs: [0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] + BS

The reason why this needs to be done in the supernet training is that during our experiments we found, for supernet without SE, doing Block Selection from beginning works well, nevertheless doing Channel Selection from the beginning will cause the network not converging at all. The Channel Selection range needs to be gradually enlarged otherwise it will crash with free-fall drop accuracy. And the range can only be allowed for (0.6 ~ 2.0). Smaller channel scales(0.2, 0.4)will make the network crashing too. For supernet with SE, Channel Selection with the full choices (0.2 ~ 2.0) can be used from the beginning and it converges. However, doing this seems like harming the accuracy. Compared to the same se-supernet with Channel Selection warm-up, the Channel Selection from scratch model has been always left behind 10% Top-1 accuracy during the whole procedure.

Subnet Searching

During the searching stage, the Block choices and Channel Choices are jointly searched in the supernet at the same time. It means that for each instance in the population of our genetic algorithm it contains 20 Block choice genes and 20 Channel choice genes. We were aiming to find a combination of these two which optimizing for each other and being complementary.

For each qualified subnet structure (has lower Σ Normalized Scores than the baseline OneShot searched model), like most weight sharing NAS approaches did, the BN statistics were updated firstly with 20,000 fixed training set images and then evaluate this subnet ImageNet validation accuracy as the indicator for its performance.

Subnet Training

For the final searched model, it was built and trained from scratch. No previous supernet weights are reused in the subnet.

As for the hyperparameters. The GluonCV official ImageNet training script has been modified to support both supernet training and subnet training. The subnet model was trained with initial learning rate 1.3, weight decay 0.00003, cosine learning rate scheduler, 4 GPUs each with batch size 256, label smoothing and no weight decay for BN beta gamma. Subnet was trained 360 epochs.

Results

Supernet Training

Supernet Searching

We tried both random search, randomly selecting 250 qualified instances to evaluate their performance, and genetic search. The genetic method easily found a better subnet structure over the random selection.

Searched Models Performance

Oneshot-S+ is a model with the block choices and channel choices searched by this implementation, ShuffleNetV2+ style SE and MobileNetV3 last convolution block design.

Oneshot+ is a customized model with the block choices and channel choices provided from the paper, ShuffleNetV2+ style SE and MobileNetV3 last convolution block design.

OneShot-S+ Profiling

A detailed op to op profiling can be found here. The calculation follows the MicroNet Challenge policy. It’s slightly different from how most papers reported FLOPs.

Summary

In this work, we provided a state-of-the-art open-sourced weight sharing Neural Architecture Search (NAS) pipeline, which can be trained and searched on ImageNet totally within 60 GPU hours (on 4 V100 GPUS) and the exploration space is about 32^20. The model searched by this implementation outperforms the other NAS searched models, such as Single Path One Shot, FBNet, MnasNet, DARTS, NASNET, PNASNET by a good margin in all factors of FLOPS, # of parameters and Top-1 accuracy. Also for considering the MicroNet Challenge Σ score, without any quantization, it outperforms other base models like MobileNet V1, V2, V3, ShuffleNet V1, V2, V2+.

Although the OneShot-S+ did achieve better MicroNet challenge score than MobileNet V3, it consumes more FLOPs than the latter one. We have been working on designing and searching for a model having both fewer parameters and fewer FLOPs than MobileNet V3. Results will be presented in the near future.

If you find this work interesting, don’t forget to clone me on the Github!