Source: Deep Learning on Medium

# Accelerate your training and inference running on Tensorflow

Are you running Tensorflow with its default setup? You can easily optimize it to your CPU/GPU and get up to 3x acceleration.

Tensorflow comes with default settings to be compatible with as many CPUs/GPUs as it can. You can easily optimize it to use the full capabilities of your CPU such as AVX or of your GPU such as Tensor Cores leading to up to a 3x accelerated code.

Similarily If you are a startup, you might not have unlimited access to GPUs or the case might be to deploy a model on CPU, you can still optimize your Tensorflow code to reduce its size for faster inference on any device. Below I’m going to discuss several ways to accelerate your Training or Inference or both.

## Build your Tensorflow from source

The most popular way to install Tensorflow in via pip but such an installation is pretty slow. Why?

The default builds from `pip install tensorflow`

are intended to be compatible with as many CPUs as possible. If you ever have seen logs in your console while running your Tensorflow program, you must have seen such a warning- “*Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA”*

What does this warning mean? Modern CPUs provide a lot of extensions to low-level instruction set such as SSE2, SSE4, AVX, etc

If you have a GPU, you shouldn’t care about AVX support, because most expensive ops will be dispatched on a GPU device (unless explicitly set not to)

Building it from the source itself might speed up your Tensorflow program significantly. TensorFlow actually warns you about doing just. You should build TensorFlow from the source optimized for *your* CPU with AVX, AVX2, and FMA enabled whichever your CPU supports.

## XLA – Accelerated Linear Algebra

Accelerated Linear Algebra, XLA is a domain-specific compiler for matrix operations. It can accelerate TensorFlow models with no changes in the source code.

When a TensorFlow program is run, all of the operations are executed individually by the TensorFlow executor. Each TensorFlow operation has a pre-compiled GPU kernel implementation that the executor dispatches to.

XLA provides an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for the given model. Because these kernels are unique to the model, they can exploit model-specific information for optimization. Along with many others, Fusion is XLA’s single most important optimization which I will discuss later in this post in detail.

The results are improvements in speed and memory usage: most internal benchmarks run ~1.15x faster after XLA is enabled.

Enabling XLA is quite easy-

`import tensorflow as tf`

tf.config.optimizer.set_jit(True)

# ... the rest of your program ..

Try XLA example in Colab here

## Mixed Precision on NVIDIA GPUs

Mixed precision training offers significant computational speedup by performing operations in the half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network.

There are numerous benefits to using numerical formats with lower precision than 32-bit floating-point. First, they require less memory, enabling the training and deployment of larger neural networks. Second, they require less memory bandwidth, thereby speeding up data transfer operations. Third, math operations run much faster in reduced precision, especially on GPUs with Tensor Core support for that precision. It does so by identifying the steps that require full precision and using 32-bit floating-point for only those steps while using 16-bit floating-point everywhere else.

- Speeds up math-intensive operations, such as linear and convolution layers, by using Tensor Cores.
- Speeds up memory-limited operations by accessing half the bytes compared to single-precision.
- Reduces memory requirements for training models, enabling larger models or larger mini-batches.

Among NVIDIA GPUs, those with compute capability 7.0 or higher will see the greatest performance benefit from mixed-precision because they have special hardware units, called Tensor Cores, to accelerate float16 matrix multiplications and convolutions.

**Mix Precision in Tensorflow**

The mixed precision API is available in TensorFlow 2.1 with Keras interface. To use mixed precision in Keras, you need to create, typically referred to as a *dtype policy*. Dtype policies specify the dtypes layers will run in. This will cause subsequently created layers to use mixed precision with a mix of float16 and float32.

`from tensorflow.keras.mixed_precision import experimental as mixed_precision`

policy = mixed_precision.Policy('mixed_float16')

mixed_precision.set_policy(policy)

# Now design your model and train it

Imp. note- Tensor Cores which provide mix precision, requires certain dimensions of tensors such as dimensions of your dense layer, number of filters in Conv layers, number of units in RNN layer to be a multiple of 8.

To compare the performance of mixed-precision with float32, change the policy from `mixed_float16`

to `float32.`

The Expected performance improvement is up to 3x.

## Improve Inference latency with Model Pruning

I already have covered this concept in one of my previous blogs. In very brief how pruning works-

If you could rank the neurons or the connection in between them according to how much they contribute, you could then remove the low ranking neurons or connections from the network, resulting in a smaller and faster network.