Accelerating mobile Semantic Segmentation models

Source: Deep Learning on Medium

Due to limited computational resources on cellphones, it is unrealistic to deploy a CNN model (Float32) with a large size of 100Mb+. Normally, people hope their models can compressed to less than 10Mb, or even 1Mb. Besides, Semantic Segmentation tasks are relatively computation-intensive (it usually requires to output a mask with same size as input).

We present several ways to optimize your model as below, especially while using TensorFlow Lite.

1 Profile models

It’s important to understand why your model slow and find the performance bottlenecks. TensorFlow Lite has an official profiler: benchmarking tool, and it will show computation time of each layer and some useful statistics.

2 Resize images

There are some popular image input sizes: 256*256, 224*224, 128*128…

The most effective and easiest approach to accelerating is to reduce input size. If you halve input images (both height and width), the computation time will decrease to 25%, expectedly. In practice, the accuracy won’t drop too much.

However you should notice that the input size should be a multiple of 8, since many frameworks have made dedicated optimizations for that. As mentioned in this post:

The optimized implementations of convolution run best when the width and height of image is multiple of 8. Tensorflow Lite first loads multiples of 8, then multiples of 4, 2 and 1 respectively. Therefore, it is the best to keep the size of every input of layer as a multiple of 8.

3 Use efficient blocks

  1. Conv2D => SeparableConv2D

According to MobilenetV1, we can replace Conv2D with SeparableConv2D, which results in about #channels times speedup.

2. Conv2DTranspose => UpSampling2D (bilinear)

There are several common ways to upsample your images, such as Conv2DTranspose, UpSampling2D (bilinear), UpSampling2D (nearest). Be aware of Conv2DTranspose, since it uses additional parameters to make normal convolution (or dilated convolution), it might be much slower than UpSampling2D (bilinear). In addition, UpSampling2D (nearest) usually has similar results.

3. Utilize state-of-the-art blocks

(1) ResNet Block

(2) ShuffleNet V2 Block (c, d in the figure)

4. Other advices

(1) Reducing FLOPS is not enough, memory access cost (MAC) should also be considered

(2) Reduce number of channels

(3) Use activation functions only involved linear computation

(4) Convolution with dilation_rate > 1 would be slower

(5) Pitfalls in softmax layer and demo code

4 Quantization

  1. Post-training quantization

If you use Post-training quantization, or Fake quantization,your model’s size will be reduced, but ironically it could slow down your model in some cases.

2. Quantization-aware training

Quantization-aware training is the most powerful way to accelerate your model. It requires retraining or fine-tunning models with calibration data. On the other hand, models may hard to converge. And therefore, it may not be suitable for non-redundant models (too few parameters).

5 Other Frameworks?

If you building models on iOS, you should definitely try CoreML, which is much faster than TFLite, and easily to convert.

If you building models on Android, you should try NCNN.