Does EfficientNets Improve CNN Scaling?

Source: Deep Learning on Medium


This is a deeper look into the Google AI ‘EfficientNets’ Improve CNN Scaling. It is a fascinating paper, that achieves not yet seen results on ImageNet dataset. It is going to be presented next week at ICML.

Let’s take a look at 2 main contributions of the paper:

  • A robust way to scale up CNN to achieve the best results given a limited amount of resources.
  • And a novel network architecture EfficientNet-B0, that, when combined with efficient scaling, shows amazing results in image recognition tasks.

The researchers took a systematic approach to select a set of network scaling parameters and showed how to apply them to relevant models. They also pushed the envelope when using this technique to their own ConvNet architecture and achieved 97.1% top-5 accuracy on ImageNet.

Scalling networks up the new way

Recently, GPipe library showed, that in some cases (after network reached sufficient depth and number of layers) it would make more sense to increase input image resolution. This way network will contain the same number of parameters and will be able to utilize fine-level image features. Which immediately raised a question: what should we scale first then? The width and the number number of blocks in the network or image resolution?

Mingxing Tan and Quoc V. Le, the authors of the original paper, outline, that the main scaling factors must be next three:

  • number of filters in a layer w
  • number of layers in the network d
  • size of the input image r

Authors also increase drop-out rate while increasing EfficientNet network size, but this is not a part of the recommended routine.

And they propose a novel technique for selecting a triplet of [r, d, w] for a larger network. The technique can be summarized in next 2 steps:

  1. Perform a grid search to select the best scaling parameters [x*r, y*d, z*w], among those that satisfy a constraint: x²*y²*z =2. Where 2 is our reference scaling budget. The idea behind it is that performing a grid search on a network only twice as large should not be a big deal.
  2. After that, for your actual scaling budget N calculate a parameter u=log(N, 2)/3 and update your x,y,z to your new scaling budget as x=pow(x, u), y = pow(y, u), z=pow(z, u).

Simple, isn’t it? Hands down, this is a great and a straight forward practical rule that is easy to understand, implement, and apply.

(Source)

Researchers show that this heuristic works a lot better than scaling any single one of these parameters alone. Single parameters scaling starts showing diminishing returns much earlier than compound scaling.

The inefficiency of EfficientNet or why not all TFLOPS are born equal

EfficientNet, as I think of it, is a bit different story.

EfficientNet is a MobileNet on steroids. EfficientNet is created using the latest Network Architecture Search techniques over a space of efficient Conv operations in combination with TFLOPs optimization objective. The main building blocks are MBConv6, MbConv1, and (a single?) Conv3x3. This means that EfficientNet inherits all the good and bad parts of MobileNet architecture.

EfficientNet architecture (source)

MobileNet good parts are not as relevant at scale. Namely, the good parts are the low number of parameters and the low number of arithmetical operations (FLOPs). These 2 characteristics are extremely important in the world of pocket devices, that don’t have much memory and are not capable of crunching through the numbers as fast as modern GPU/TPU units. When it comes to GPUs, depthwise convolutions used in MobileNet are not as beneficial in terms of compute. It is because the implementation of regular convolutions is so damn fast. And the number of parameters is not at all a problem, because most of the memory is consumed by layer activations anyway. Remember the last time you ran out of your 11Gb of GPU memory while training a 100Mb model?

EfficientNet, while being a marvelous engineering product, is unlikely to replace, say, AmoebaNet when it comes to image recognition. Due to lower efficiency of GPU implementation of its’ operations, high latency, and difficulty to train. Nevertheless, it is an exciting piece of research that would allow us to scale better and will inspire the creation of new models for mobile devices.