How to Ship Machine Learning Models into Production with TensorFlow Serving and Kubernetes

Source: Deep Learning on Medium

How to Ship Machine Learning Models into Production with TensorFlow Serving and Kubernetes

Learn how to ship in this 5-minute read on TensorFlow Serving

by Vidar Nordli-Mathisen

In a previous series, I covered the nuts and bolts of developing machine learning algorithms with TensorFlow. Now it’s time to demonstrate how to release our models into production.

In this post, you’ll learn:

  • Why we use TensorFlow Serving;
  • How to release models through TensorFlow Serving, with or without GPUs
  • How to deploy models into Kubernetes

Last week, I held a workshop in which I demonstrated how to create a text classifier with NLP framework Kashgari in 15 minutes. Here is the full source code.

Today, I will use this same code base to show you how to build this text classifier, and we will deploy the model in a couple of ways.

Preparing the Model

First and foremost, let’s get dependencies in place:

wget chinese_L-12_H-768_A-12.zipunzip chinese_L-12_H-768_A-12.zipsudo pip3 install jieba pandas kashgari tensorflow-gpu==1.14.0 # default tool old on AWSsudo pip3 install — upgrade — force-reinstall scipy

Now we can start training the text classifier model built with Kashgari:

import tensorflow as tf
import pandas as pd
import kashgari
from kashgari.embeddings import BERTEmbedding
from kashgari.tasks.classification import BiLSTM_Model

# enable
kashgari.config.use_cudnn_cell = True

# BERt model path
BERT_PATH = 'chinese_L-12_H-768_A-12'

# Embeddings
embed = BERTEmbedding(BERT_PATH,

tokenizer = embed.tokenizer

df = pd.read_csv('weibo_senti_100k.csv')
# tokenizer
df['cutted'] = df['review'].apply(lambda x: tokenizer.tokenize(x))

# data prep
train_x = list(df['cutted'][:int(len(df)*0.7)])
train_y = list(df['label'][:int(len(df)*0.7)])

valid_x = list(df['cutted'][int(len(df)*0.7):int(len(df)*0.85)])
valid_y = list(df['label'][int(len(df)*0.7):int(len(df)*0.85)])

test_x = list(df['cutted'][int(len(df)*0.85):])
test_y = list(df['label'][int(len(df)*0.85):])

# model
model = BiLSTM_Model(embed), train_y, valid_x, valid_y, batch_size=1024, epochs=1)
model.evaluate(test_x, test_y, batch_size=512)

random_stuff = ['高兴','好难过','这个好简单','真的是折腾']
model.predict([tokenizer.tokenize(i) for i in random_stuff])'bert_model')
kashgari.utils.convert_to_saved_model(model, 'tf_bert_model', version=1)

Save model in SavedModel (such a unique name, lol) format:'bert_model')
kashgari.utils.convert_to_saved_model(model, 'tf_bert_model', version=1)

This method is specific to the Kashgari library, so you have to apply your own framework-specific save method if you are using something else.

Now the model is ready, and it’s time to serve!

TensorFlow Serving

What is TensorFlow Serving? Here is the official definition:

TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments.

And it has tons of features:

  • Serves multiple versions. TensorFlow Serving works with multiple models, or multiple versions of the same model simultaneously.
  • Multiple protocols. Exposes both gRPC as well as HTTP inference endpoints.
  • Ease of deployment. Allows deployment of new model versions without changing any client code.
  • Canary Releases. Supports canarying of new versions and A/B testing of experimental models
  • Low latency. Adds minimal latency to inference time due to efficient, low-overhead implementation.
  • Advanced scheduler. Features a scheduler that groups individual inference requests into batches for joint execution on GPU, with configurable latency controls.
  • One ring to rule them all! Supports many servables: Tensorflow models, embeddings, vocabularies, feature transformations, and even non-Tensorflow-based machine learning models.

Why do we use TensorFlow Serving?

Well, if you’re not using it, you have to manually implement your pipeline to upgrade your model when new versions are available.

In addition, your training infrastructure may be different from the final inference infrastructure. In that case, you’d have to make manual accommodations to make it work and take extra steps to optimize performance .

In short, you don’t have to pull your hair out with all that menial work, you can just use TensorFlow Serving!

Though TensorFlow Serving can run on its own, it’s much more convenient to use with Docker. Even in official documentation, they skip introducing the standalone version; so will I. Let’s jump to the Docker version!

Docker, With or Without GPU

Docker is fast to set up, hygienic and doesn’t pollute the host system. To release a GPU model with Docker:

docker run --rm --runtime=nvidia -p 8501:8501 \
-v `pwd`/tf_bert_model:/models/tf_bert_model \
-e MODEL_NAME=tf_bert_model -t tensorflow/serving:1.14.0-gpu

And if you don’t have GPU available for inference, you can use CPU instead. However, for the example above, using CuDNN for training makes it impossible to use CPU inference.

You have to make the model compatible with CPU instructions; namely, you can’t use CuDNN or some other fancy GPU-only instructions for your model.

Once the training is done, all you have to do is remove GPU from the end of the last command:

docker run --rm --runtime=nvidia -p 8501:8501 \
-v `pwd`/tf_bert_model:/models/tf_bert_model \
-e MODEL_NAME=tf_bert_model -t tensorflow/serving:1.14.0

Voila! It’s ready. The model is up and Serving (pun intended).

Time to write the client code:

import kashgari
import numpy as np
import requests
from kashgari import utils
from kashgari.embeddings import BERTEmbedding

random_stuff = ['好高兴', '好欢乐', '心情低落']

BERT_PATH = 'chinese_L-12_H-768_A-12'

# Embeddings
embed = BERTEmbedding(BERT_PATH,
tokenizer = embed.tokenizer

processor = utils.load_processor(model_path='tf_bert_model/1')

tensor = processor.process_x_dataset([tokenizer.tokenize(i) for i in random_stuff])

tensor = [{
"Input-Token:0": i.tolist(),
"Input-Segment:0": np.zeros(i.shape).tolist()
} for i in tensor]

r ="",
json={"instances": tensor})
preds = r.json()['predictions']
label_index = np.array(preds).argmax(-1)

labels = processor.reverse_numerize_label_sequences(label_index)
print(dict(zip(random_stuff, label_index)))

This is the code for the client; it invokes TensorFlow Serving REST endpoints. First, we tokenize the phrase with the BERT tokenizer;then we convert the result into BERT format. Finally, we use POST BERT formatted chunk to invoke the TensorFlow Serving server and get back the results.

Although the Docker container is clean and convenient, it can’t solve scaling problems — at least not on its own. We need a container orchestration framework for that.

Let’s see how can we release models with the best container orchestration system, Kubernetes.

Deploying Models with Kubernetes

Kubernetes (“koo-burr-NET-eez”) is the conventional pronunciation of a Greek word, κυβερνήτης, meaning “helmsman” or “pilot”. It is literally a pilot for a sea of containers.

Kubernetes, with its countless, awesome features, is becoming the de facto habitat for production applications.

Going back to our model, we can take things one step further: we can release the model into Kubernetes.

Building and Pushing the Image

To make it scalable and future-safe, we can build a fat image with the TensorFlow Serving base image:

FROM tensorflow/serving:1.14.0ADD tf_bert_model /models/tf_bert_model# gRPC port 8500# REST port 8501EXPOSE 8500 8501

Depending on the platform you are using, the standard (recommended) building image processes could vary from platform to platform.

I use Google Cloud for my personal projects. And for pushing images into Google Cloud container registry, you can use CloudBuild script.

Here is my script:


- name: ''args: [ 'build', '-t', '$PROJECT_ID/kashgari- workshop:latest', '.' ]images:- '$PROJECT_ID/kashgari-workshop'

And to trigger the building process, use this command:

gcloud builds submit

It’s quite handy, actually. Now the Docker image is ready. Time to deploy.

Releasing into the Kubernetes Cluster

And here is the deployment file for our image:

# deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
name: tfserving
replicas: 1
- name: tfserving
image: tensorflow/serving:1.14.0
imagePullPolicy: Always
- name: MODEL_NAME
value: tf_bert_model
- containerPort: 8501
name: http-tfserving
protocol: TCP

One thing to remember is that MODEL_NAME should be consistent with the model name in Dockerfile.

All set, time to deploy:

kubectl create -f deployment.yaml

Once the Pod is ready, we can forward Pod port 8501 to our local machine port 8501:

kubectl port-forward POD_NAME 8501:8501

Now we can invoke our model, deployed with Kubernetes, with the same code we wrote earlier.

Yeah, it is easy and fun.

To avoid an excessively long post, I skipped lots of details. Leave a comment below if you have any questions.

Now that we covered how to ship machine learning models into production, let’s recap the ground you’ve gained in my series so far. You now know:

Now the circle is complete. I hope I’ve shed some light on your journey of creating and serving machine learning models.

Thanks for reading! If you enjoyed this article, please hit the clap button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge.

Feel free to share your questions and comments here, and follow me so you don’t miss the latest content!