How to deploy ONNX models in production

Source: Deep Learning on Medium

How to deploy ONNX models in production

Interoperability is a problem in machine learning.

People have their preferences when it comes to frameworks. Some people prefer to do research in prototyping in one framework while preferring to do model serving in another. For many, using a certain framework for one task, and a different framework for another would be ideal. Unfortunately, most major frameworks make this kind of interoperability nearly impossible.

ONNX, an open source model format, is hoping to change that. ONNX’s vision is to be a universal model format, allowing models to be passed between frameworks and tools with maximum interoperability. A model can be trained with one framework, converted to ONNX, and deployed with another.

In this guide, we’re going to show how you can train and deploy an ONNX model in production using the famous iris classifier data set.

Step 1. Train and export your model

First, we need a model to deploy.

In this example, we’re going to use XGBoost to train our classifier. Once trained, we will convert our model to ONNX and upload it to S3, from where we will deploy it in later steps.

You can train, convert, and upload your model by simply running this notebook. Once your model is uploaded, you can deploy it.

Step 2. Deploy your ONNX model with Cortex

Serving realtime inferences at production scale requires some pretty significant infrastructure work. You’ll need to implement:

  • Autoscaling so that your server can launch as many models as needed to handle increases in traffic.
  • Prediction monitoring to keep on top of your model’s performance.
  • Rolling updates to allow you to update your model without downtime.

To do this manually, you’d need to use a suite of tools like Docker, Kubernetes, and too many Amazon services to count. You can, however, simply automate all of this work with Cortex.

Cortex is an open source tool that deploys your models as production-ready APIs on AWS. It automates all of the work we listed above, and lets you focus on doing data science—not infrastructure work.

Cortex’s ONNX integration requires just a few files to launch your API:

  • A Python script to handle and parse user requests and model predictions.
  • A config file to define your deployment.

Our Python script, which we’ll call, needs both pre_inference() and post_inference() methods, which you can see in this example:

As you can see, most of the work these methods do is around parsing and labeling.

Next, Cortex needs a config file called cortex.yaml to define your deployment. Your cortex.yaml should look like this (obviously, input your S3 bucket info):

You’ll also need a file called requirements.txt, which will tell Cortex which dependencies to install, but for this example your requirements.txt only needs to be one line long:


With your cortex.yaml defined, your written, and your model uploaded to S3, you can deploy directly from the command line by running cortex deploy:

$ cortex deploycreating classifier api

Now, you can use your API.

Step 3. Serve predictions from your ONNX model

One of the things Cortex does is generate an HTTP endpoint for you to query your model. You can access this endpoint by running:

$ cortex get classifierendpoint: http://***

With your endpoint in hand, you can then query it using any technology capable of sending HTTP requests. For the sake of testing, we’ll use curl:

$ curl http://*** \
-X POST -H "Content-Type: application/json" \
-d '{"sepal_length": 5.2, "sepal_width": 3.6, "petal_length": 1.4, "petal_width": 0.3}'

And that’s it. You now have a functioning API serving realtime inferences from your ONNX model. You can check on the health of your model at any point by running:

$ cortex get classifier --watchstatus up-to-date available requested last updatelive 1 1 1 8s endpoint: http://***

If you have any questions, feel free to ask them in the Cortex Gitter!