Source: Deep Learning on Medium
Our approach is mainly based on the Single Path One-Shot NAS in the combination of Squeeze and Excitation (SE), ShuffleNet V2+ and MobileNet V3. Like the original paper, the choice blocks and block channel scales are searched with multiple FLOPs and parameter amount constraints. In this section, some implementation details will be elaborated.
Supernet Structure Design
ShuffleNasBlock, 4 choice blocks were explored,
ShuffleXceptionBlock-3x3 (SXB-3). Within each block, 8 channel choices are available:
[0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] * (BlockOutputChannel / 2). So each
32 possible choices and there are
20 blocks in this implementation, counting for totally
32^20 design choices.
We also applied the SE, ShuffleNet V2+ SE layout and the MobileNet V3 last convolution block design in the supernet. Finally, the supernet contains
15.4 Million trainable parameters and the possible subnet FLOPs range from
Unlike what the original Single Path One-Shot NAS did, in the training stage, we didn’t apply uniform distribution from the beginning. The supernet was trained totally
120 epochs. In the first
60 epochs only Block selection was applied and, for the upcoming
60 epochs, a new approach Channel Selection Warm-up was introduced, which gradually allows the supernet to be trained with a larger range of channel choices.
# Supernet sampling schedule: during channel selection warm-up
1 - 60 epochs: Only block selection (BS)
61 epoch: [1.8, 2.0] + BS
62 epoch: [1.6, 1.8, 2.0] + BS
63 epoch: [1.4, 1.6, 1.8, 2.0] + BS
64 epoch: [1.2, 1.4, 1.6, 1.8, 2.0] + BS
65 - 66 epochs: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0] + BS
67 - 69 epochs: [0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] + BS
70 - 73 epochs: [0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] + BS
The reason why this needs to be done in the supernet training is that during our experiments we found, for supernet without SE, doing Block Selection from beginning works well, nevertheless doing Channel Selection from the beginning will cause the network not converging at all. The Channel Selection range needs to be gradually enlarged otherwise it will crash with free-fall drop accuracy. And the range can only be allowed for
(0.6 ~ 2.0). Smaller channel scales
(0.2, 0.4)will make the network crashing too. For supernet with SE, Channel Selection with the full choices
(0.2 ~ 2.0) can be used from the beginning and it converges. However, doing this seems like harming the accuracy. Compared to the same se-supernet with Channel Selection warm-up, the Channel Selection from scratch model has been always left behind
10% Top-1 accuracy during the whole procedure.
During the searching stage, the Block choices and Channel Choices are jointly searched in the supernet at the same time. It means that for each instance in the population of our genetic algorithm it contains
20 Block choice genes and
20 Channel choice genes. We were aiming to find a combination of these two which optimizing for each other and being complementary.
For each qualified subnet structure (has lower
Σ Normalized Scores than the baseline OneShot searched model), like most weight sharing NAS approaches did, the BN statistics were updated firstly with
20,000 fixed training set images and then evaluate this subnet ImageNet validation accuracy as the indicator for its performance.
For the final searched model, it was built and trained from scratch. No previous supernet weights are reused in the subnet.
As for the hyperparameters. The GluonCV official ImageNet training script has been modified to support both supernet training and subnet training. The subnet model was trained with initial learning rate
1.3, weight decay
0.00003, cosine learning rate scheduler, 4 GPUs each with batch size
256, label smoothing and no weight decay for BN beta gamma. Subnet was trained
We tried both random search, randomly selecting 250 qualified instances to evaluate their performance, and genetic search. The genetic method easily found a better subnet structure over the random selection.
Searched Models Performance
Oneshot-S+ is a model with the block choices and channel choices searched by this implementation, ShuffleNetV2+ style SE and MobileNetV3 last convolution block design.
Oneshot+ is a customized model with the block choices and channel choices provided from the paper, ShuffleNetV2+ style SE and MobileNetV3 last convolution block design.
A detailed op to op profiling can be found here. The calculation follows the MicroNet Challenge policy. It’s slightly different from how most papers reported FLOPs.
In this work, we provided a state-of-the-art open-sourced weight sharing Neural Architecture Search (NAS) pipeline, which can be trained and searched on ImageNet totally within
60 GPU hours (on 4 V100 GPUS) and the exploration space is about
32^20. The model searched by this implementation outperforms the other NAS searched models, such as
Single Path One Shot, FBNet, MnasNet, DARTS, NASNET, PNASNET by a good margin in all factors of FLOPS, # of parameters and Top-1 accuracy. Also for considering the MicroNet Challenge Σ score, without any quantization, it outperforms other base models like
MobileNet V1, V2, V3, ShuffleNet V1, V2, V2+.
Although the OneShot-S+ did achieve better MicroNet challenge score than MobileNet V3, it consumes more FLOPs than the latter one. We have been working on designing and searching for a model having both fewer parameters and fewer FLOPs than MobileNet V3. Results will be presented in the near future.
If you find this work interesting, don’t forget to clone me on the Github!