A Basic Introduction to TensorFlow Lite

Original article was published on Deep Learning on Medium

A Basic Introduction to TensorFlow Lite

An introduction to TensorFlow Lite Converter, Quantized Optimization, and Interpreter to run Tensorflow Lite models at the Edge

In this article, we will understand the features required to deploy a deep learning model at the Edge, what is TensorFlow Lite, and how the different components of TensorFlow Lite can be used to make an inference at the Edge.


You are trying to deploy your deep learning model in an area where they don’t have a good network connection but still need the deep learning model to give an excellent performance.

TensorFlow Lite can be used in such a scenario

Features of a Deep Learning model to make inference at the Edge

Tensorflow Lite offers all the features required for making inferences at the Edge.

But what is TensorFlow Lite?

TensorFlow Lite is an open-source, product ready, cross-platform deep learning framework that converts a pre-trained model in TensorFlow to a special format that can be optimized for speed or storage.

The special format model can be deployed on edge devices like mobiles using Android or iOS or Linux based embedded devices like Raspberry Pi or Microcontrollers to make the inference at the Edge.

How does Tensorflow Lite(TF Lite) work?

Select and Train a Model

let’s say you want to perform the Image Classification task. The first thing is to decide on the Model for the task. Your options are

Convert the Model using Converter

After the model is trained, you will convert the Model to the Tensorflow Lite version. TF lite model is a special format model efficient in terms of accuracy and also is a light-weight version that will occupy less space, these features make TF Lite models the right fit to work on Mobile and Embedded Devices.

TensorFlow Lite conversion Process


During the conversion process from a Tensorflow model to a Tensorflow Lite model, the size of the file is reduced. We have a choice to either go for further reducing the file size with a trade-off with the execution speed of the Model.

Tensorflow Lite Converter converts a Tensorflow model to Tensorflow Lite flat buffer file(.tflite).

Tensorflow Lite flat buffer file is deployed to the client, which in our cases can be a mobile device running on iOS or Android or an embedded device.

How can we convert a TensorFlow model to the TFlite Model?

After you have trained the Model, you will now need to save the Model.

The saved Model serializes the architecture of the Model, the weights and the biases, and training configuration in a single file. The saved model can be easily used for sharing or deploying the models.

The Converter supports the Model saved using

#Save the keras model after compiling
model_keras= tf.keras.models.load_model('model_keras.h5')
# Converting a tf.Keras model to a TensorFlow Lite model.
converter = tf.lite.TFLiteConverter.from_keras_model(model_keras)
tflite_model = converter.convert()
#save your model in the SavedModel format
export_dir = 'saved_model/1'
tf.saved_model.save(model, export_dir)
# Converting a SavedModel to a TensorFlow Lite model.
converter = lite.TFLiteConverter.from_saved_model(export_dir)
tflite_model = converter.convert()

export_dir follows a convention where the last path component is the version number of the Model.

SavedModel is a meta graph saved on the export_dir, which is converted to the TFLite Model using lite.TFLiteConverter.

Export the Model as a concrete function and then convert the concrete function to TF Lite model

# export model as concrete function
func = tf.function(model).get_concrete_function(
tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype))
#Returns a serialized graphdef representation of the concrte function
# converting the concrete function to Tf Lite
converter = tf.lite.TFLiteConverter.from_concrete_functions([func])
tflite_model = converter.convert()

Optimize the Model

Why optimize model?

Models at Edge needs to be light-weight

Models at Edge should also have low latency to run inferences. Leight weight and low latency model can be achieved by reducing the amount of computation required to predict.

Optimization reduces the size of the model or improves the latency. There is a trade-off between the size of the model and the accuracy of the model.

How is optimization achieved in TensorFlow Lite?

Tensorflow Lite achieves optimization using


When we save the TensorFlow Model, it stores as graphs containing the computational operation, activation functions, weights, and biases. The activation function, weights, and biases are 32-bit floating points.

Quantization reduces the precision of the numbers used to represent different parameters of the TensorFlow model and this makes models light-weight.

Quantization can be applied to weight and activations.

Weight Pruning

Just as we prune plants to remove non-productive parts of the plant to make it more fruit-bearing and healthier, the same way we can prune weights of the Model.

Weight pruning trims parameters within a model that has very less impact on the performance of the model.

Weight pruning achieves model sparsity, and sparse models are compressed more efficiently. Pruned models will have the same size, and run-time latency but better compression for faster download time at the Edge.

Deploying the TF Lite model and making an Inference

TF lite model can be deployed on mobile devices like Android and iOS, on edge devices like Raspberry and Microcontrollers.

To make an inference from the Edge devices, you will need to

# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_content=tflite_model)
#get input and output tensors
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Read the image and decode to a tensor
img = cv2.imread(image_path)
img = cv2.resize(img,(WIDTH,HEIGHT))
#Preprocess the image to required size and cast
input_shape = input_details[0]['shape']
input_tensor= np.array(np.expand_dims(img,0), dtype=np.float32)
#set the tensor to point to the input data to be inferred
input_index = interpreter.get_input_details()[0]["index"]
interpreter.set_tensor(input_index, input_tensor)
#Run the inference
output_details = interpreter.get_output_details()[0]

Is there any other way to improve latency?

Tensorflow lite uses delegates to improve the performance of the TF Lite model at the Edge. TF lite delegate is a way to hand over parts of graph execution to another hardware accelerator like GPU or DSP(Digital Signal Processor).

TF lite uses several hardware accelerators for speed, accuracy, and optimizing power consumption, which important features for running inferences at the Edge.

Conclusion: TF lite models are light-weight models that can be deployed for a low-latency inference at Edge devices like mobiles, Raspberry Pi, and Micro-controllers. TF lite delegate can be used further to improve the speed, accuracy and power consumption when used with hardware accelerators