Meet MixNet: Google Brain’s new State of the Art Mobile AI architecture.

Source: Deep Learning on Medium

Meet MixNet: Google Brain’s new State of the Art Mobile AI architecture.

Summary: By replacing single convolutional kernels with a mixed grouping of 3×3–9×9 kernels, and a neural search ‘MixNet’ architecture, a new state of the art 78.9% accuracy on ImageNet top 1% is achieved under standard mobile metrics. MixNet-L outperforms ResNet-153 with 8x fewer params, and MixNet-M matches it exactly but with 12x fewer params and 31x fewer FLOPS.

Tan and Le of Google Brain recently show-cased a new depthwise convolutional kernel arrangement (MixConv) and a new NN architecture optimized for efficiency and accuracy using MixConvs in their paper: Mixed depthwise convolutional kernels.

This article will summarize the architecture of MixConv’s, the core building block for MixNet, and the MixNet NN architecture itself in preparation for you to use in your own deep learning projects. (Personally I’m planning to use on our next run at the FastAI leaderboards).

Let’s start with results: MixNet was able to surpass the current suite of mobile architectures as shown below and set a new top 1% record (State of the Art) with MixNet-L:

MixNet versus current suite of mobile deep learning architectures. (from paper).

Perhaps even more impressive are the computational efficiency comparisons. Mobile based AI naturally requires maximum computational efficiency and comparing MixNet-M and MixNet-L to a ResNet-153 is … eye-opening:

MixNet-M with 5M params matches ResNet-153 accuracy …but with 60M params and nearly 31x lower FLOPS!

In the chart above, ‘type’ refers to how the NN architecture was built. Manual meaning hand built, combined meaning hand built with some neural architecture search, and auto meaning all neural architecture search.

More importantly, a quick review of the performance shows you the substantial efficiency of the MixNet architecture. MixNet-M with only 5M params matches ResNet-153 with it’s 60M params. Further, FLOPS for MixNet-M are 360M vs 11B for ResNet-153.

How it works: Hopefully the results have convinced you MixNet is worth taking a deeper look and to understand MixNet, we start with ‘MixConvs’ and the initial testing Tan and Le did around kernel sizes.

In a standard neural net, the depthwise convolution is usually done via a fixed size kernel. Currently most use 3×3 as a series of 3×3 was shown to be more efficient vs. the older initial 7×7 architecture for the stem.

However, Tan and Le decided to test out the effect of changing kernel sizes tracking accuracy vs kernel size, and ran some experiments dropping in different kernel sizes into the mobilenet architecture. Here is what they found:

From this study they were able to discern that a blend of kernel sizes could be more effective than a single fixed kernel, while at the same time capping the max kernel size at 9×9.

The smaller kernels (3×3, 5×5) serve to capture lower resolution details, while the larger kernels (7×7, 9×9) capture higher resolution patterns..and thus ultimately build a more efficient network.

MixConvs — with the above intuition, the concept of blending multiple kernels sizes into a single layer was developed and termed “MixConvs”.

MixConv’s partition the incoming channels into groups and run various size kernels against these groups. This illustration from the paper clarifies:

MixConvs cluster incoming channels into groups and run various size kernels vs the typical one kernel size on all channels. (from the paper).

The result is that by combining kernels, both low resolution and high resolution patterns are captured producing a more accurate, and efficient, depthwise convolution block.

A code example (TensorFlow) from the paper helps drive home the process:

MixConvs become MixNets: In order to progress the MixConv’s into more optimal form, the authors had to determine what blend of kernel sizes should be used, aka group size. Ultimately using a range of 1–5 kernels was deemed to be the best standard and by leveraging Neural architecture search, a final MixNet-M was built. MixNet-L is simply MixNet-M with a depth multiplier of 1.3. Below is the MixNet-M architecture showing the varying kernel groups:

MixNet-M architecture (from the paper). MixNet-L is simply a 1.3 depth multiplier of -M.

As you can see, at the start smaller kernels are used similar to the current trends with modern ResNets. However, larger kernels are then steadily integrated in as data flows through the layers. Unlike single kernel convolutions where larger kernels typically degrade accuracy, the blended MixConv’s leverage 7×7 and 9×9 kernels effectively to capture higher resolution patterns.

Performance results in a nutshell: The performance results of the final MixNet architecture are perhaps best summarized with a comparison vs ResNet-153. Besides setting a new State of the Art record for ImageNet 1% accuracy on mobile computational power (< 600M FLOPS), it beats a generic ResNet-153 while using nearly an order of magnitude fewer params and FLOPS:

Direct comparison on MixNet-M and -L vs ResNet-153.

Using MixNet in your projects: For TensorFlow users, you are in luck as MixNet has been open sourced and is available at this github:

PyTorch and FastAI users: A PyTorch implementation (unofficial) is available here:

I’m hoping to take the above PyTorch impl and add in Mish activation, remove the dropout layers, and possibly slip in a few other tweaks regarding replacing avgpooling (article coming), and make it available for FastAI users and our FastAI team to use against our current leaderboard records.

Conclusion: MixNet’s, by blending a number of kernel sizes and a new architecture, are an impressive contribution to improving Neural Network efficiency and accuracy. MixNet’s are a classic example of powering deep learning forward with better architecture rather than just adding computational power to get better results. Congrats to the researchers for their advancement of AI and for open-sourcing the MixNet architecture!

**Note: To avoid confusion, MixNets were also used as a naming term in a previous 2018 paper (Mixed Links) to refer to an architecture where resnet and densenet style connections were interwoven — link to that paper: https://arxiv.org/abs/1802.01808v1