Accelerate your training and inference running on Tensorflow

Source: Deep Learning on Medium

Accelerate your training and inference running on Tensorflow

Are you running Tensorflow with its default setup? You can easily optimize it to your CPU/GPU and get up to 3x acceleration.

Tensorflow comes with default settings to be compatible with as many CPUs/GPUs as it can. You can easily optimize it to use the full capabilities of your CPU such as AVX or of your GPU such as Tensor Cores leading to up to a 3x accelerated code.

Similarily If you are a startup, you might not have unlimited access to GPUs or the case might be to deploy a model on CPU, you can still optimize your Tensorflow code to reduce its size for faster inference on any device. Below I’m going to discuss several ways to accelerate your Training or Inference or both.

Build your Tensorflow from source

The most popular way to install Tensorflow in via pip but such an installation is pretty slow. Why?

The default builds from pip install tensorflow are intended to be compatible with as many CPUs as possible. If you ever have seen logs in your console while running your Tensorflow program, you must have seen such a warning- “Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA”

What does this warning mean? Modern CPUs provide a lot of extensions to low-level instruction set such as SSE2, SSE4, AVX, etc

If you have a GPU, you shouldn’t care about AVX support, because most expensive ops will be dispatched on a GPU device (unless explicitly set not to)

Building it from the source itself might speed up your Tensorflow program significantly. TensorFlow actually warns you about doing just. You should build TensorFlow from the source optimized for your CPU with AVX, AVX2, and FMA enabled whichever your CPU supports.

XLA – Accelerated Linear Algebra

Accelerated Linear Algebra, XLA is a domain-specific compiler for matrix operations. It can accelerate TensorFlow models with no changes in the source code.

When a TensorFlow program is run, all of the operations are executed individually by the TensorFlow executor. Each TensorFlow operation has a pre-compiled GPU kernel implementation that the executor dispatches to.

XLA provides an alternative mode of running models: it compiles the TensorFlow graph into a sequence of computation kernels generated specifically for the given model. Because these kernels are unique to the model, they can exploit model-specific information for optimization. Along with many others, Fusion is XLA’s single most important optimization which I will discuss later in this post in detail.

The results are improvements in speed and memory usage: most internal benchmarks run ~1.15x faster after XLA is enabled.

Enabling XLA is quite easy-

import tensorflow as tf


# ... the rest of your program ..

Try XLA example in Colab here

Mixed Precision on NVIDIA GPUs

Mixed precision training offers significant computational speedup by performing operations in the half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network.

There are numerous benefits to using numerical formats with lower precision than 32-bit floating-point. First, they require less memory, enabling the training and deployment of larger neural networks. Second, they require less memory bandwidth, thereby speeding up data transfer operations. Third, math operations run much faster in reduced precision, especially on GPUs with Tensor Core support for that precision. It does so by identifying the steps that require full precision and using 32-bit floating-point for only those steps while using 16-bit floating-point everywhere else.

  • Speeds up math-intensive operations, such as linear and convolution layers, by using Tensor Cores.
  • Speeds up memory-limited operations by accessing half the bytes compared to single-precision.
  • Reduces memory requirements for training models, enabling larger models or larger mini-batches.
ResNet-50 training for ImageNet classification — 8 GPUs on DGX-1 Comparing to FP32 training →3x speedup — equal accuracy Source — Nvidia

Among NVIDIA GPUs, those with compute capability 7.0 or higher will see the greatest performance benefit from mixed-precision because they have special hardware units, called Tensor Cores, to accelerate float16 matrix multiplications and convolutions.

Mix Precision in Tensorflow

The mixed precision API is available in TensorFlow 2.1 with Keras interface. To use mixed precision in Keras, you need to create, typically referred to as a dtype policy. Dtype policies specify the dtypes layers will run in. This will cause subsequently created layers to use mixed precision with a mix of float16 and float32.

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
# Now design your model and train it

Imp. note- Tensor Cores which provide mix precision, requires certain dimensions of tensors such as dimensions of your dense layer, number of filters in Conv layers, number of units in RNN layer to be a multiple of 8.

To compare the performance of mixed-precision with float32, change the policy from mixed_float16 to float32. The Expected performance improvement is up to 3x.

Improve Inference latency with Model Pruning

I already have covered this concept in one of my previous blogs. In very brief how pruning works-

If you could rank the neurons or the connection in between them according to how much they contribute, you could then remove the low ranking neurons or connections from the network, resulting in a smaller and faster network.

Pruning in Tensorflow

Tensorflow provides Model Optimization Toolkit for pruning and other post-training optimizations. To use it in your code, here is a simple example-

import tensorflow_model_optimization as tfmot

model = build_your_model()

pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.0, final_sparsity=0.5,
begin_step=1000, end_step=3000)

model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(model, pruning_schedule=pruning_schedule)


Fusing multiple ops into a single op

Normally when you run a TensorFlow graph, all of the operations are executed individually by the TensorFlow graph executor. Each op has a pre-compiled GPU kernel implementation. Fused Ops combine operations into a single kernel for improved performance. For example-

def model_fn(x, y, z): 
return tf.reduce_sum(x + y * z)

Without fusion, without XLA, the graph launches three kernels: one for the multiplication, one for the addition and one for the reduction.

With op fusion, you can compute the result in a single kernel launch. It does this by “fusing” the addition, multiplication, and reduction into a single GPU kernel.

Fusion with Tensorflow 2.x

Newer Tensorflow versions come with XLA which does fusion along with other optimizations for us.

from tensorflow.contrib.compiler import xla

def model_fn(x, y, z):
return tf.reduce_sum(x + y * z)

def create_and_run_graph():
with tf.Session() as sess:
x = tf.placeholder(tf.float32, name='x')
y = tf.placeholder(tf.float32, name='y')
z = tf.placeholder(tf.float32, name='z')
result = xla.compile(computation=model_fn, inputs=(x, y, z))[0]
# `result` is a normal Tensor (albeit one that is computed by an XLA
# compiled executable) and can be used like any other Tensor.
result = tf.add(result, result)
return, feed_dict={ ... })

Examples of patterns fused:

■ Conv2D + BiasAdd + <Activation>

■ Conv2D + FusedBatchNorm + <Activation>

■ Conv2D + Squeeze + BiasAdd

■ MatMul + BiasAdd + <Activation>

Fusing ops together provides several performance advantages:

○ Completely eliminates Op scheduling overhead (big win for cheap ops)

○ Increases opportunities for ILP, vectorization etc.

○ Improves temporal and spatial locality of data access

E.g. MatMul is computed block-wise and bias and activation function can be applied while data is still “hot” in cache.

Fusion with Tensorflow 1.x

In Tf 1.x, layers compatible with fused ops have ‘fused’ argument which needs to be set to True to use fusion for faster implementation.

For example-

#Using TF1.x in TF2.x
b1 = tf.layers.batch_normalization(
input_layer, fused=True, data_format='NCHW')

#Or in pure TF1.x
b1 = tf.layers.batch_normalization