Serving Deep Learning Model in Production using Fast and Efficient gRPC

Original article was published by Renu Khandelwal on Deep Learning on Medium

Serving Deep Learning Model in Production using Fast and Efficient gRPC

A quick and simple guide to serving a deep learning model using gRPC API

In this post, you will learn

  • What is gRPC?
  • How does gRPC work?
  • Benefits of gRPC
  • Difference between gRPC and REST API
  • How to implement gRPC API using Tensorflow Serving to serve a model in Production?

What is gRPC?

gRPC is a Remote procedure call platform developed by Google.

GRPC is a modern open-source, high-performance, low latency and high speed throughput RPC framework that uses HTTP/2 as transport protocol and uses protocol buffers as the Interface Definition Language(IDL) and also as its underlying message interchange format

How does gRPC work?

Inspired by:
  • A gRPC channel is created that provides a connection to a gRPC server on a specified port.
  • The client invokes a method on the stub as if it is a local object; the server is notified of the client gRPC request.
  • gRPC uses Protocol Buffers to interchange messages between client and server. Protocol Buffers are a way to encode structured data in an efficient, extensible format.
  • Once the server receives the client’s request, it executes the method and sends the client’s response back with a status code and optional metadata.
  • gRPC allows clients to specify wait time to allow the server to respond before the RPC call is terminated.

What are the benefits of using gRPC?

  • gRPC uses binary payloads, which are efficient to create and parse and hence light-weight.
  • Bi-directional streaming is possible in gRPC, which is not the case with REST API
  • gRPC API is built on top of HTTP/2 supporting the traditional request and response steaming as well as bi-directional streaming
  • 10 times faster message transmission compared to REST API as gRPC uses serialized Protocol Buffers and HTTP/2
  • Loose coupling between client and server makes it easy to make changes
  • gRPC allows integration of API’s programmed in different languages

What’s the difference between gRPC and REST API?

  • Payload Format: REST uses JSON for exchanging messages between client and server, whereas gRPC uses Protocol Buffers. Protocol Buffers are compressed better than JSON, thus making gRPC transmit data over networks more efficiently.
  • Transfer Protocols: REST heavily uses HTTP 1.1 protocol, which is textual, whereas gRPC is built on the new HTTP/2 binary protocol that compresses the header with efficient parsing and is much safer.
  • Streaming vs. Request-Response: REST supports the Request-Response model available in HTTP1.1. gRPC uses bi-directional streaming capabilities available in HTTP/2, where the client and server send a sequence of messages to each other using a read-write stream.

How to implement gRPC AI using Python for Deep Learning models?

Steps to create a gRPC API for a deep learning model using TF Serving

  1. Create the request payload from the client to the server as a Protocol Buffer(.proto) file. The client invokes the API through the Stub.
  2. Run the docker image that exposes port 8500 for accepting the gRPC request and sending a response back to the client
  3. Run the server and client(s).

Implementing gRPC API

To implement REST API using Tensorflow Serving, follow this blog.

For Windows 10, we will use a TensorFlow serving image.

Step 2: Pull the TensorFlow Serving Image

docker pull tensorflow/serving

Once you have the TensorFlow Serving image

  • Expose Port 8500 for gRPC
  • Optional environment variable MODEL_NAME (defaults to model)
  • Optional environment variable MODEL_BASE_PATH (defaults to /models)

Step 3: Create and Train the Model

Here I have taken the MNIST dataset from TensorFlow datasets

#Importing required libraries
import os
import json
import tempfile
import requests
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
#Loading MNIST train and test dataset
#as_supervised=True, will return tuple instead of a dictionary for image and label
(ds_train, ds_test), ds_info = tfds.load("mnist", split=['train','test'], with_info=True, as_supervised=True)#to select the 'image' and 'label' using indexing coverting train and test dataset to a numpy array
array = np.vstack(tfds.as_numpy(ds_train))
X_train = np.array(list(map(lambda x: x[0], array)))
y_train = np.array(list(map(lambda x: x[1], array)))
X_test = np.array(list(map(lambda x: x[0], array)))
y_test = np.array(list(map(lambda x: x[1], array)))
#setting batch_size and epochs
#Creating input data pipeline for train and test dataset
# Function to normalize the imagesdef normalize_image(image, label):
#Normalizes images from uint8` to float32
return tf.cast(image, tf.float32) / 255., label
# Input data pipeline for test dataset
#Normalize the image using map function then cache and shuffle the #train dataset
# Create a batch of the training dataset and then prefecth for #overlapiing image preprocessing(producer) and model execution work #(consumer)ds_train =
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(batch_size)
ds_train = ds_train.prefetch(
# Input data pipeline for test dataset (No need to shuffle the test #dataset)
ds_test =
ds_test = ds_test.batch(batch_size)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(
# Build the model
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
tf.keras.layers.Dense(196, activation='softmax')
#Compile the model
#Fit the model

Step 4: Save the Model

Saving the model into a protocol buffer file by specifying the save_format as “tf”.

version = "1"
export_path = os.path.join(MODEL_DIR, str(version))
#Save the model, save_format="tf")
print('\nexport_path = {}'.format(export_path))
!dir {export_path}

You can examine the model using the saved_model_cli command.

!saved_model_cli show --dir {export_path} --all
Model inputs and outputs with their data types and size

Step 5:Serving the model using gRPC

Importing libraries for gRPC implementation

import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
from tensorboard.compat.proto import types_pb2

Establish the channel between the client and server using the gRCP port 8500. Create the client stub for the client to communicate with the server

channel = grpc.insecure_channel('')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

Create the request payload for the server as a Protocol Buffer by specifying the model name and model input, data type, and data size and shape.

request = predict_pb2.PredictRequest() = 'mnist'
request.inputs['flatten_input'].CopyFrom(tf.make_tensor_proto(X_test[0],dtype=types_pb2.DT_FLOAT, shape=[28,28,1]))

If the data type and data size do not match the model input, you will get an error “input size does not match signature”.

To resolve this error, check the model input data type and size and match it with the request sent to gRPC.

Run the docker image that exposes port 8500 for accepting the gRPC request

docker run -p 8500:8500 --mount type=bind,source=C:\TF_serving\tf_model,target=/models/mnist/ -e MODEL_NAME=mnist -t tensorflow/serving

The source should be an absolute path.

The server is now ready to accept the client request

To predict the result of the request, call the Predict method from the stub

result=stub.Predict(request, 10.0)
result- response from gRPC server
print(" predicted output :", res)

Displaying the input image using matplotlib

import matplotlib.pyplot as plt
%matplotlib inline
img = X_test[0].reshape(28,28)
plt.imshow(img, cmap="gray")


gRPC is Google’s new Remote Procedure call API, which is approximately 10 times faster than the REST API. gRPC is built on HTTP/2, which uses Protocol Buffers to interchange bi-directional messages between client and server efficiently.