Serving a deep learning model in production using tensorflow-serving

Source: Deep Learning on Medium

Go to the profile of Kushal P

In layman terms, exposing any client-server software over a host( ip address:port ) is called serving and this provides easy access to any software/data for the client. Serving provides a way to enable easy productization of any software(which here is a deep learning model). Well, easy production doesn’t really imply easy setup and serving. While going through the tedious process of serving a DL model, I encountered a number of errors and issues. As good documentation is essential for easy usage of DL serving, I decided to write this article.

Tensorflow Serving

Serving any trained model over a host, so that the client can request the host to predict the result over an input is called model serving. Machine learning models are usually of large size and hence, loading the model every time for prediction consumes lots of time and space. This process is avoided by model serving, also allowing multiple users to access the model.

TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.

How it works?

Image result for tensorflow serving

A tensorflow model is served using the tensorflow serving docker image and the API is exposed on the specified host address. The client sends the input and requests for the predicted result from the API. Servable is the term used for the object(model) which the client use for computation. The size and granularity of a Servable is flexible. A single Servable might include anything from a single shard of a lookup table to a single model to a tuple of inference models.

Any request from the client is received by the manager of tf serving . Manager is responsible for loading, serving and unloading Servables.

Loaders manage a Servable’s life cycle. The Loader API enables common infrastructure independent from specific learning algorithms, data or product use-cases involved. Specifically, Loaders standardize the APIs for loading and unloading a Servable.

Sources are plugin modules that find and provide servables. Each Source provides zero or more servable streams. For each servable stream, a Source supplies one Loader instance for each version it makes available to be loaded. (A Source is actually chained together with zero or more SourceAdapters, and the last item in the chain emits the Loaders.) TensorFlow Serving’s interface for Sources can discover servables from arbitrary storage systems. TensorFlow Serving includes common reference Source implementations. For example, Sources may access mechanisms such as RPC and can poll a file system.

Aspired versions represent the set of servable versions that should be loaded and ready. Sources communicate this set of servable versions for a single servable stream at a time.

Managers listen to Sources and track all versions. The Manager tries to fulfill Sources’ requests, but may refuse to load an aspired version if, say, required resources aren’t available. Managers may also postpone an “unload”. For example, a Manager may wait to unload until a newer version finishes loading, based on a policy to guarantee that at least one version is loaded at all times.

To know more about the internal working, click here

How to serve a model using tensorflow serving?

Starting off, we need to pull the docker image containing tensorflow serving using the command below:

docker pull tensorflow/serving

After pulling the tensorflow serving image, we are ready to serve our tensorflow model using the command below:

sudo docker run -p 8500:8500 --name tf_container_name — mount type=bind,source=/path/to/model,target=/path/to/destination -e MODEL_NAME=modelname -t tensorflow/serving &

We have to ensure that the tag provided while saving the model is ‘serve’ and the model directory contains variables directory and pb file of the saved model. Also, all the dependencies should be installed while serving the model. The model will be served in the port 8500 and the rest api in 8501.

Now, for the client side, we use the gRPC service and write a python script to send the input and recieve the output of the model.

We need to import the tensorflow_serving api functions ‘predict_pb2’ and ‘prediction_service_pb2_grpc’ and also ensure that they are present in the path tensorflow_serving.apis.

from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

Now, we have all the initial setup done to access the port on which the model is served. We establish a channel with the gRPC server and create a stub for prediction.

channel = grpc.insecure_channel(FLAGS.server)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

After which, we create a prediction request object and specify the model name and signature name. At this point, we have to specify the same name of the model as declared during serving. The inputs are to be provided in the same format as required by the tensorflow model.

request = predict_pb2.PredictRequest() = 'model_name'
request.model_spec.signature_name = 'serving_default'
tf.contrib.util.make_tensor_proto(input, shape=input_shape))

We are now ready to predict. We use predict function of the stub created and provide a timeout value.

result = stub.Predict(request, 30.0)

The result variable will contain the prediction result of the tensorflow model.

Now imagine v1 and v2 of the model are dynamically generated at runtime, as new algorithms are being experimented with, or as the model is trained with a new data set. In a production environment, we may want to build a server that can support gradual rollout, in which v2 can be discovered, loaded, experimented, monitored, or reverted while serving v1. Alternatively, we may want to tear down v1 before bringing up v2. TensorFlow Serving supports both options(click here to know more)- while one is good for maintaining availability during the transition, the other is good for minimizing resource usage (e.g. RAM).

Whenever a new version is available, the AspiredVersionsManager loads the new version, and under its default behavior unloads the old one.

Hope this helped!