How to deploy (almost) any Hugging face model on NVIDIA Triton Inference Server with an…

Original article was published by Sachin Sharma on Deep Learning on Medium

How to deploy (almost) any Hugging face model on NVIDIA Triton Inference Server with an application to Zero-Shot-Learning for Text Classification


In this blog post, We examine Nvidia’s Triton Inference Server (formerly known as TensorRT Inference Server) which simplifies the deployment of AI models at scale in production. For the purpose of this examination, we mainly focus on hosting Transformer Language Models like BERT, GPT2, BART, RoBerta (Multilingual Natural Language Inferencing Model), etc. Afterward, to solve the problem of zero-shot-text-classification, we will be using Hugging’s Face RoBerta model for deployment on the Triton server, once deployed we can make inference requests and can get back the predictions. For, setting up the Triton inference server we generally need to pass two hurdles: 1) Set up our own inference server, and 2) After that, we have to write a python client-side script which can communicate with the inference server to send requests (in our case text) and get back predictions or text feature embeddings.


  1. Nvidia CUDA enabled GPU: For, this blog post I am using GeForce RTX 2080 Nvidia GPU having a memory size of around 12 Gb.
  2. Nvidia Docker
  3. Triton Client libraries for communication with Triton inference server
  4. PyTorch

Basic Introduction (Why do we need Nvidia’s Triton Inference Server)

Image depicting the capability of Triton server to host Multiple heterogeneous deep learning frameworks (src:

The one thing which attracted all of us (AI team of Define Media) the most is the capability of the Triton inference server to host/deploy trained models from any framework (whether it is a TensorFlow, TensorRT, PyTorch, Caffe, ONNX, Runtime, or some custom framework) from local storage or Google Cloud Platform or AWS S3 on any GPU- or CPU-based infrastructure (cloud, data center, or edge). In Nvidia’s triton framework, model checkpoints are optimized/compressed (Quantization and Pruning in case of PyTorch models) before serving which decreases memory footprint on the GPU and makes it memory efficient and robust to serve multiple models on the GPUs.

Anecdote: Tensorflow is the most popular framework in our team but due to our recent interest in text processing and the popularity gain by Hugging Face transformer models have diverted our attention to dig deeper into the PyTorch models especially for text processing (Although I am using PyTorch framework through all my research at the university). Therefore, we need one common platform where we can host multiple trained models based on different frameworks, and without compromising too much with the throughput and latency across various model types, hence, Nvidia’s Triton Inference server is an unparalleled choice.

Some more Feature of TriTon:

  1. Concurrent model execution support: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU.
  2. Batching Support: Triton can deal with a batch of input request and its corresponding batch of predictions.
  3. Ensemble support
  4. Multi-GPU support. Triton can distribute inferencing across all system GPUs.
  5. Model repositories may reside on a locally accessible file system (e.g. NFS), in Google Cloud Storage, or in Amazon S3. (This feature plays a very important role if we want to deploy triton server on the cloud and make inference requests via the most popular Lambda Functions in case of AWS)
  6. Metrics indicating GPU utilization, server throughput, and server latency. The metrics are provided in the Prometheus data format.
  7. Model version support

For a detailed description, the reader can go through this documentation: Nvidia Triton’s official documentation.

Part1- Setting up our own TRITON Inference Server

So, now we have a basic understanding of why do we need a Triton inference server. Let’s start by setting up a triton server locally on the computer by following the below steps.

  1. We are gonna use the Prebuilt Docker Container available from the NVIDIA GPU Cloud (NGC). For more information, see Using A Prebuilt Docker Container.

2. Clone the Triton Inference Server GitHub repository if you need an example model repository (here all our trained models will be stored). Go to and then select the clone or download the drop-down button (this will also download some pre-trained models structured in a manner as expected by Triton). After cloning the repo be sure to select the r<xx.yy> release branch that corresponds to the version of Triton you want to use: git checkout r20.06

After cloning, you can find pre-trained models under server → docs →examples →model_repository

3. Use docker pull to get the Triton Inference Server container from NGC:

docker pull<xx.yy>-py3

Where <xx.yy> is the version of Triton that you want to pull. To be on the same ground please download the version 20.06-py3 else you might face some dependencies issues with triton client libraries. Once installed, you can view triton server container using the command :

sudo docker images

4. It’s that easy!

Note: Above are the minimum steps needed to create a triton server. For detailed installation steps you can give a look at its official docs here: Installing triton, but I hope that this blog presents standalone and substantial instructions, to begin with, minimal struggle.

Instantiate Triton Inference Server

$ docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/example/model/repository:/models <docker image> tritonserver --model-repository=/models

Where <docker image> is<xx.yy>-py3 if you pulled the Triton container from NGC. -v flag points to the path of your model repository where all your models are stored as showed above.

Verify Triton is running Correctly

curl -v localhost:8000/v2/health/ready

The expected output should be (by default triton provide services on port 8000) :

< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain

Since now we have run the Triton server with the default models already present in the model repository. The next step would be to add Hugging Face’s RoBerta model to the model repository in such a manner that it would be accepted by the triton server. This includes the following steps: 1) Convert the model in a format that the server can locate, 2) Writing a config.pbtxt model configuration file, and 3) Instantiate the server again with this newly added model.

Note: With the help of the following steps we can convert almost any Hugging Face PyTorch model into the Triton acceptable model.

Step 1: Load and Convert Hugging Face Model

Conversion of the model is done using its JIT traced version. According to PyTorch’s documentation: ‘Torchscript’ is a way to create serializable and optimizable models from PyTorch code”. It allows the developer to export their model to be re-used in other programs, such as efficiency-oriented C++ programs.

Exporting a model requires: Dummy inputs and Standard length to execute the model’s forward pass. During the model’s forward pass with dummy inputs, PyTorch keeps the track of different operations on each tensor and records these operations to create the “trace” of the model. Since the created trace is relative to the dummy input dimensions, therefore the model inputs in the future will be constrained by the dimension of the dummy input, and will not work for other sequences length or batch size. It is therefore recommended to trace the model with the largest dummy input dimension that you can think can be fed to the model in the future. Apart from this, we can always use padding or truncation on input sequences.

The above code snippet shows you a way to perform a trace on the Pytorch model using dummy inputs and saves the model in a format accepted by triton server

Next, save the model in the model repository folder with the following directory structure (You can add as many models as you want with different frameworks into this model repository):

|- <pytorch_model_name>/
| |- config.pbtxt
| |- 1/
| |-

Step2: Write the Configuration File

This configuration file, config.pbtxt contains the detail of permissible input/outputs types and shapes, favorable batch sizes, versioning, platform since the server doesn’t know details about these configurations, therefore, we write them into a separate configuration file.

A configuration file for Hugging Face’s RoBerta Model is as follows:

name: "zst"
platform: "pytorch_libtorch"
input [
name: "input__0"
data_type: TYPE_INT32
dims: [1, 256]
} ,
name: "input__1"
data_type: TYPE_INT32
dims: [1, 256]
output {
name: "output__0"
data_type: TYPE_FP32
dims: [1, 3]

Now we will again Instantiate the triton server with this above-added model in the model repository:

docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/example/model/repository:/models <docker image> tritonserver --model-repository=/models

Zero-Shot-Learning for Text Classification

The recent release of the GPT-3 model released by Open AI is one of the largest NLP model in human history, with whooping175 billion parameters. This gigantic model has achieved promising results under zero-shot, few-shot, and one-shot settings and in some cases even surpassed state-of-the-art models. All of this got me interested in to dig deeper into the process of zero-shot learning in NLP. Before the success of transformer models, most of the zero-shot learning research was concentrated towards Computer Vision only, but now, there has been a lot of interesting work going on in the NLP domain as well due to the increased quality of the sentence embeddings.

What is Zero-Shot-Learning (ZSL)?

In short, ZSL is the ability to detect classes that the model has never seen during training. In this blog post, I am using the Latent embedding approach where we find the latent embeddings of the premise (given input sequence) and hypothesis (label against which we want to classify the premise) by embedding both the premise and hypothesis into the same space of model and then finding the cosine similarity between these two sentence embeddings. The task is to determine whether the hypothesis is true (entailment) or false (contradiction) given the premise. All this is done using a Natural Language Inference approach on which RoBerta (a sequence pair classification model) model is being trained on.

Explaining in detail ZSL is out of the scope of this blog but the above description will suffice the need for the blog. Curious readers can read the Hugging Face team detailed explanation of ZSL which can be found over here: ZSL

Client-Side Script to Interact with Triton Inference Server for Zero-Shot-Text Classification

import argparse
import numpy as np
import sys
from functools import partial
import os
import tritongrpcclient
import tritongrpcclient.model_config_pb2 as mc
import tritonhttpclient
from tritonclientutils import triton_to_np_dtype
from tritonclientutils import InferenceServerException
from transformers import XLMRobertaTokenizer
from scipy.special import softmax
R_tokenizer = XLMRobertaTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')
# hypothesis for topic classification
topic = 'This text is about space & cosmos'
input_name = ['input__0', 'input__1']
output_name = 'output__0'
def run_inference(premise, model_name='zst', url='', model_version='1'):
triton_client = tritonhttpclient.InferenceServerClient(
url=url, verbose=VERBOSE)
model_metadata = triton_client.get_model_metadata(
model_name=model_name, model_version=model_version)
model_config = triton_client.get_model_config(
model_name=model_name, model_version=model_version)
# I have restricted the input sequence length to 256
input_ids = R_tokenizer.encode(premise, topic, max_length=256, truncation=True, padding='max_length')
input_ids = np.array(input_ids, dtype=np.int32)
mask = input_ids != 1
mask = np.array(mask, dtype=np.int32)

mask = mask.reshape(1, 256)
input_ids = input_ids.reshape(1, 256)
input0 = tritonhttpclient.InferInput(input_name[0], (1, 256), 'INT32')
input0.set_data_from_numpy(input_ids, binary_data=False)
input1 = tritonhttpclient.InferInput(input_name[1], (1, 256), 'INT32')
input1.set_data_from_numpy(mask, binary_data=False)
output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=False)
response = triton_client.infer(model_name, model_version=model_version, inputs=[input0, input1], outputs=[output])
logits = response.as_numpy('output__0')
logits = np.asarray(logits, dtype=np.float32)
# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true
entail_contradiction_logits = logits[:,[0,2]]
probs = softmax(entail_contradiction_logits)
true_prob = probs[:,1].item() * 100
print(f'Probability that the label is true: {true_prob:0.2f}%')
# topic classification premises
if __name__ == '__main__':
run_inference('Jupiter’s Biggest Moons Started as Tiny Grains of Hail')

Output: Probability that the label is true: 98.28%


In this blog post, we have seen how to set up your own triton inference server, what are the advantages of using a triton server, and how to write a minimum python script to start communicating with triton sever i.e sending requests and receiving back predictions. This will be a series of blog posts wherein the next blog post I am gonna wrap client-side script into AWS Lambda functions, deploy it using SLS deploy on AWS, where it will communicate with the Triton server deployed on AWS EC2 instance. So if you are interested to read more articles on AI, Machine Learning, Deep Learning, NLP, and AWS you can start following me here and on LinkedIn.


I would like to thanks Define Media Company for letting me use their resources for writing this blog and my team lead/supervisor Dennis Jöst for his constant support.