Deploying a TensorFlow Model to Production made Easy.

Original article was published by Renu Khandelwal on Deep Learning on Medium

Deploying a TensorFlow Model to Production made Easy.

Deploy a Deep Learning Model to Production using TensorFlow Serving.

Learn step by step deployment of a TensorFlow model to Production using TensorFlow Serving.

You created a deep learning model using Tensorflow, fine-tuned the model for better accuracy and precision, and now want to deploy your model to production for users to use it to make predictions.

What’s the best way to deploy your model to production?

Fast, flexible ways to deploy a TensorFlow deep learning model is to use high performing and highly scalable serving system-Tensorflow Serving

TensorFlow Serving allows you to

  • Easily manage multiple versions of your model, like an experimental or stable version.
  • Keep your server architecture and APIs the same
  • Dynamically discovers a new version of the TensorFlow flow model and serves it using gRPC(remote procedure protocol) using a consistent API structure.
  • Consistent experience for all clients making inferences by centralizing the location of the model

What are the components of TensorFlow Serving that makes deployment to production easy?

TensorFlow Serving Architecture

The key components of TF Serving are

  • Servables: A Servable is an underlying object used by clients to perform computation or inference. TensorFlow serving represents the deep learning models as one ore more Servables.
  • Loaders: Manage the lifecycle of the Servables as Servables cannot manage their own lifecycle. Loaders standardize the APIs for loading and unloading the Servables, independent of the specific learning algorithm.
  • Source: Finds and provides Servables and then supplies one Loader instance for each version of the servable.
  • Managers: Manage the full lifecycle of the servable: Loading the servable, Serving the servable, and Unloading the servable.
  • TensorFlow Core: Manages lifecycle and metrics of the Servable by making the Loader and servable as opaque objects

Let’s say you have two different versions of a model, version 1 and version 2.

  • The clients make an API call by either specifying a version of the model explicitly or just requesting the model’s latest version.
  • Managers listen to the Sources and keep track of all the versions of the Servable; it then applies the configured version policy to determine which version of the model should be loaded or unloaded and then let’s Loader load the appropriate version.
  • The loader contains all the meta-data to load the Servable.
  • The Source plug-in will create an instance of Loader for each version of the Servable.
  • The Source makes a callback to the Manager to notify the Aspired Version of the Loader to be loaded and serve it to the client.
  • Whenever the Source detects a new version of the Servable, it creates a Loader pointing to the Servable on the disk.

How to deploy a deep learning model using Tensorflow serving on Windows 10?

For Windows 10, we will use a TensorFlow serving image.

Step 2: Pull the TensorFlow Serving Image

docker pull tensorflow/serving

Once you have the TensorFlow Serving image

  • Port 8500 is exposed for gRPC
  • Port 8501 is exposed for the REST API
  • Optional environment variable MODEL_NAME (defaults to model)
  • Optional environment variable MODEL_BASE_PATH (defaults to /models)

Step 3: Create and Train the Model

Here I have taken the MNIST dataset from TensorFlow datasets

#Importing required libraries
import os
import json
import tempfile
import requests
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
#Loading MNIST train and test dataset
#as_supervised=True, will return tuple instead of a dictionary for image and label
(ds_train, ds_test), ds_info = tfds.load("mnist", split=['train','test'], with_info=True, as_supervised=True)
#to select the 'image' and 'label' using indexing coverting train and test dataset to a numpy array
array = np.vstack(tfds.as_numpy(ds_train))
X_train = np.array(list(map(lambda x: x[0], array)))
y_train = np.array(list(map(lambda x: x[1], array)))
X_test = np.array(list(map(lambda x: x[0], array)))
y_test = np.array(list(map(lambda x: x[1], array)))
#setting batch_size and epochs
#Creating input data pipeline for train and test dataset
# Function to normalize the images
def normalize_image(image, label):
#Normalizes images from uint8` to float32
return tf.cast(image, tf.float32) / 255., label
# Input data pipeline for test dataset
#Normalize the image using map function then cache and shuffle the #train dataset
# Create a batch of the training dataset and then prefecth for #overlapiing image preprocessing(producer) and model execution work #(consumer)
ds_train =
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(batch_size)
ds_train = ds_train.prefetch(
# Input data pipeline for test dataset (No need to shuffle the test #dataset)
ds_test =
ds_test = ds_test.batch(batch_size)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(
# Build the model
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
tf.keras.layers.Dense(196, activation='softmax')
#Compile the model
#Fit the model

Step 4: Save the Model

Saving the model into a protocol buffer file by specifying the save_format as “tf”.

version = "1"
export_path = os.path.join(MODEL_DIR, str(version))
#Save the model, save_format="tf")
print('\nexport_path = {}'.format(export_path))
!dir {export_path}

When we save a version of the model, we can see the following directories containing files:

  • Saved_model.pb: Contains the serialized graph definition of one or more model along with the metadata of the model as a MetaGraphDef protocol buffer. Weights and variables are stored in the separate checkpoint files.
  • Variables: files that hold the standard training checkpoint

You can examine the model using the saved_model_cli command.

!saved_model_cli show --dir {export_path} --all

Step 5: Serving the model using Tensorflow Serving

Open Windows Powershell and execute the following command to start the TensorFlow Serving container for serving the TensorFlow model using the REST API port.

docker run -p 8501:8501 --mount type=bind,source=C:\TF_serving\tf_model,target=/models/mnist/ -e MODEL_NAME=mnist -t tensorflow/serving 

To successfully serve the TensorFlow model with Docker.

  • Open the port 8501 to serve the model using -p
  • Mount will bind the model base path, which should be an absolute path to the container’s location where the model will be saved.
  • The name of the model client will use to call by specifying the MODEL_NAME
  • assign a pseudo-terminal “tensorflow/serving” using -t option
output of the docker run command

Step 6: Make a REST request the model to predict

We will create a JSON object to pass the data for prediction.

#Create JSON Object
data = json.dumps({“signature_name”: “serving_default”, “instances”: X_test[:20].tolist()})

Request the model’s predict method as a POST to the server’s REST endpoint.

headers = {"content-type": "application/json"}
json_response ='
http://localhost:8501/v1/models/mnist:predict', data=data, headers=headers)
predictions = json.loads(json_response.text)['predictions']

Checking the accuracy of the prediction

pred=[ np.argmax(predictions[p]) for p in range(len(predictions)) ]
print("Predictions: ",pred)
print("Actual: ",y_test[:20].tolist())

In the next article, we will explore the different model server configurations.


TensorFlow Serving is a fast, flexible, highly scalable, and easy-to-use way to serve your production model using consistent gRPC or REST APIs.