The what and what not of running deep learning inference on mobile



This concept has been in my drafts from very long time, I have realized the lack of documentation around this still leads to confusion, So, I would like to publish the learnings from my experience and understanding of the same.

The article assumes, the basic knowledge on how deep learning works and their implementations. The assumption is, there is a model trained and works on the server or local laptop in predicting things.

Now, if you want to do something similar on mobile, there are multiple ways around, we will discuss step by step on the same. To take this from an example point of view, assume you have video and you want to process it around through deep learning module and get analytic metrics as on output.

The entire explanation will concentrate and what and what not to do while implementing above block diagram on mobile.

1. Send the data to the server for processing

The idea here is to build communication module, which communicates with the server, send the content to the server and let the server infer the data and send the metadata back to the device.

Advantages:

No fuss between the mobile development team and data science team communications, send the data and get the data

Disadvantages:

-> Very High Latency
-> Availability of internet, synching up with the server (altogether a different problem)
-> Privacy (cannot send the data outside my phone)

This disadvantages cannot be ignored and this leads us to the next big idea,

There are various ways of achieving the same, we will discuss one after the other with their pros and cons. The easiest way in this direction is to

2. Use existing wrappers of dev frameworks

We will start with the famous and easy to start considering the development framework.

Few Libraries around the same direction are,

TensorFlow Lite

This is the mobile version of Tensorflow, there won’t be major learning curve around as the documentation and support are very well maintained. Train the model and run the inference on mobile, as if, you are running on your local machine.

Advantages:

-> Minimal effort in running around the conversion of models, limited transfer of knowledge from data science and mobile development team.
-> Comparatively faster than server approach
-> No platform dependency

DisAdvantages:

-> Big model size
-> Super slow compared to laptop runtime

Caffe2Go:

This used to be supported by Facebook’s Caffe team, Now, they have embedded this as part of PyTorch. I haven’t worked on the latest versions of this, but, I will give my profiling as per my experience on the same a year back.

Advantages:

-> Again, minimal effort in pushing from development to production, provided, if you were working on Caffe
-> Comparatively faster than server approach and tensorflow lite
-> No platform dependency

Disadvantages:

-> Big model size
-> Again, super slow compared to laptop time
-> Highly unstable at the time of my usage ( a year back)

At this stage, we have drastically improved our performance compared to server mode, but, this is still not something, which we can use out of the box. The major problems with the above approaches are Speed and Model size,
 we will try to address each one individually

Processing Speed:

To solve the issue of speed, we will try to go through the same route on how we solved the same problem on computers, the solution was simple, use a better hardware, especially introduction of GPU changed things

The same way, if we can use hardware acceleration or specific hardware on mobile, there is a possibility of speeding things up. This lead to the way of exploring, how to use hardware efficiently on mobile.
 This leads to dividing the entire hardware space into two types:

Controlled Hardware:

The reason to call this as consistent is that, over the variations of different models, there is no major change in the hardware aspect, except few minor changes. So, developing hardware specific models is easier on Apple devices, which lead to the exploration of what is the available hardware capability.

Interestingly enough, we were exposed to the metal framework at that point in time, which just released convolution kernel operations as part of their framework upgrade. We built the model around the operators using the metal framework and it worked with super speed! (Inception V3: 150ms)

The usage of metal is irrelevant now, considering Apple’s latest mobile framework for machine learning.

CoreML!

Convert -> Load -> Infer !

It supports most of the major operators and libraries around. wrappers for conversion of the model from almost all the libraries are available. Develop in whichever framework you are interested in, convert your model to core ML format with the available wrappers, then load the model and get the result, you can see the results in as less as 6 lines of code.

Uncontrolled Hardware:
 Android can be picked up here, as there is a huge variation from manufacturer to manufacturer in terms of hardware

This platform itself is slightly complex to solve it in the same way we did for Apple. One of the ways, which we have targeted in solving this through, hardware specific libraries like Qualcomm Neural Processing SDK, which uses snapdragon 820+ phones extra hardware capabilities.

This SDK API uses snapdragon’s GPU and DSP capabilities to make things faster in terms of inference. The process is the same as Core ML, train in any development framework, convert the model and use it on the device using SDK.

Profiling!

This approach roughly solves our idea of speed at-least in a basic way, the more upgradations on speed can be done according to your respective application.

Model Size:

By default, the model sizes are very high, if it’s directly ported from server to mobile. The bigger the model size, the more memory it consumes and things get drastically difficult when the user notices all these things and you don’t want to push the user into downloading gigabytes of data every time you upgrade your model. And there are various other technical difficulties with the bigger model.
 So, the solution is to reduce the size, which leads us to following interesting approaches.

Quantisation:
 Convert the float or double values to int values and this leads to a reduction of model size by 4 times, well technically speaking, it is slightly tricky to do all these things, but, enough are the resources around to check it out.

I found Peter warden’s articles around the same very helpful at that point of time, I would suggest everyone skim through before trying to use the existing things in the market now.

https://petewarden.com/2017/06/22/what-ive-learned-about-neural-network-quantization/

A few months back, even tensorflow officially released the supporting codes for the same

https://www.tensorflow.org/performance/quantization

Apple’s Core ML already started supporting 16-bit precision instead of 32-bit precision for the same reason.

https://developer.apple.com/documentation/coreml/reducing_the_size_of_your_core_ml_app

Model Pruning:
 Removing unimportant node weights from the model, by understanding which model weights are not that useful.

https://jacobgil.github.io/deeplearning/pruning-deep-learning

https://arxiv.org/pdf/1611.06440.pdf

Network Optimisation:
 Understanding what is needed and what not is an important aspect here.
 One has to optimise the network accordingly, like replacing fully connected with 1d convolutions. There are various other ways of doing the same if one understands their network good enough.

Conclusion:

I will conclude by answering the title of the story

What not to do?

Don’t use the server-based approach, if data is images/video. you can get away using server-based for small texts and voice-based data.

What to do?

Considering my experience through all the above-mentioned points, there is no single solution on how to approach the problem, it should be entirely based on what kind of problem you are looking to solve.

Having been said that, If the problem is based on the image, then, I strongly suggest write your own wrappers around the hardware specific libraries, which gives you a good boost on models speeds and supports the model size operations, which you might end up doing.

P.s: My experimentations on tensorflow lite are very old, please feel free to experiment around that again, as they might have updated their modules as per the industry.

You might also be interested in exploring paddle-mobile, Baidu’s mobile version for their library. I haven’t worked on this completely yet but heard good profiling reviews around. Having been said that, The library at this point of time is unstable from what I hear

https://github.com/PaddlePaddle/paddle-mobile

Another, very big suggestion I want to put out is, Use ONNX as your preferred format for serialisation/deserialisation of models in development. You will thank me around it after a year or so.

https://github.com/onnx/onnx
https://onnx.ai/

These opinions are completely based on my experimentations and experiences a year back, might not even work for you. So, please feel free to object, if the content deviates from the facts, I will be more than happy to edit it according to the latest standards.

Source: Deep Learning on Medium