A first look at AWS Inferentia

Source: Deep Learning on Medium

A first look at AWS Inferentia

Launched at AWS re:Invent 2019, AWS Inferentia is a high performance machine learning inference chip, custom designed by AWS: its purpose is to deliver cost effective, low latency predictions at scale. Inferentia is present in Amazon EC2 inf1 instances, a new family of instances also launched at re:Invent.

In this post, I’d like to show you how to get started with Inferentia and TensorFlow. Please note that Apache MXNet, PyTorch and ONNX are also supported.

A primer on Inferentia

The CMP324 breakout session is a great introduction to Inferentia, and the Alexa use case is a rare look undet the hood. It’s well worth your time.

In a nutshell, each Inferentia chip hosts 4 Neuron Cores. Each one of these implements a “high performance systolic array matrix multipy engine” (nicely put, Gadi), and is also equipped with a large on-chip cache.

NeuronCores are interconnected, which makes it possible to:

  • Partition a model across multiple cores (and Inferentia chips, if several are available), storing it 100% in on-cache memory.
  • Stream data at full speed through the pipeline of cores, without having to deal with latency caused by external memory access.

Alternatively, you can run inference with different models on the same Inferentia chip. This is achieved by partitioning NeuronCores into NeuronCore Groups, and by loading different models on different groups.

The Neuron SDK

In order to run on Inferentia, models first need to be compiled to a hardware-optimized representation. Then, they may be loaded, executed and profiled using a specific runtime. These operations can be performed through command-line tools available in the AWS Neuron SDK, or through framework APIs.

Let’s get started!

Launching an EC2 instance for model compilation

This first step doesn’t require an inf1 instance. In fact, you should use a compute-optimized instance for fast and cost effective compilation. In order to avoid any software configuration, you should also use the Deep Learning AMI, which comes preinstalled with the Neuron SDK and with updated frameworks.

At the time of writing, the most recent is Deep Learning AMI for Amazon Linux 2 is version 26.0, and it’s AMI identifier is ami-08e68326c36bf3710.

Using this AMI, I fire up a c5d.4xlarge instance. No special settings are required, just make sure you allow SSH access in the Security Group.

family, instance name, vCPUs, RAM, storage

Once the instance is up, I ssh to it, and I’m greeted by the familiar Deep Learning AMI banner, tellling me that Conda environments are available for TensorFlow and Apache MXNet.

__| __|_ )
_| ( / Deep Learning AMI (Amazon Linux 2) Version 26.0
Please use one of the following commands to start the required environment with the framework of your choice:
for MXNet(+Keras2) with Python3 (CUDA 10.1 and Intel MKL-DNN)
source activate mxnet_p36
for MXNet(+Keras2) with Python2 (CUDA 10.1 and Intel MKL-DNN)
source activate mxnet_p27
for MXNet(+AWS Neuron) with Python3
source activate aws_neuron_mxnet_p36

for TensorFlow(+Keras2) with Python3 (CUDA 10.0 and Intel MKL-DNN) source activate tensorflow_p36
for TensorFlow(+Keras2) with Python2 (CUDA 10.0 and Intel MKL-DNN) source activate tensorflow_p27
for TensorFlow(+AWS Neuron) with Python3
source activate aws_neuron_tensorflow_p36

for TensorFlow 2(+Keras2) with Python3 (CUDA 10.0 and Intel MKL-DNN) ssource activate tensorflow2_p36
for TensorFlow 2(+Keras2) with Python2 (CUDA 10.0 and Intel MKL-DNN) ssource activate tensorflow2_p27
for PyTorch with Python3 (CUDA 10.1 and Intel MKL)
source activate pytorch_p36
for PyTorch with Python2 (CUDA 10.1 and Intel MKL)
source activate pytorch_p27
for Chainer with Python2 (CUDA 10.0 and Intel iDeep)
source activate chainer_p27
for Chainer with Python3 (CUDA 10.0 and Intel iDeep)
source activate chainer_p36
for base Python2 (CUDA 10.0)
source activate python2
for base Python3 (CUDA 10.0)
source activate python3
Official Conda User Guide: https://docs.conda.io/projects/conda/en/latest/user-guide/
AWS Deep Learning AMI Homepage: https://aws.amazon.com/machine-learning/amis/
Developer Guide and Release Notes: https://docs.aws.amazon.com/dlami/latest/devguide/what-is-dlami.html
Support: https://forums.aws.amazon.com/forum.jspa?forumID=263
For a fully managed experience, check out Amazon SageMaker at https://aws.amazon.com/sagemaker
When using INF1 type instances, please update regularly using the instructions at: https://github.com/aws/aws-neuron-sdk/tree/master/release-notes

I activate the appropriate environment, which provides all required dependencies.

For the rest of this post, any shell command prefixed by (aws_neuron_tensorflow_p36) should be run inside that Conda environment.

$ source activate aws_neuron_tensorflow_p36
(aws_neuron_tensorflow_p36) $

Next, I upgrade the tensorflow-neuron package.

$ conda install numpy=1.17.2 --yes --quiet
$ conda update tensorflow-neuron

We’re now ready to fetch a model and compile it.

Compiling a model

The code below fetches a ResNet50 image classification model pretrained on the ImageNet dataset, and stores it in the resnet50 directory.

Then, it compiles it for Inferentia. I highlighted the single line of code required: everything else is vanilla TensorFlow. Then, the compiled model is saved in the ws_resnet50 directory, and in a ZIP file for easy copy to an inf1 instance.

import os
import time
import shutil
import tensorflow as tf
import tensorflow.neuron as tfn
import tensorflow.compat.v1.keras as keras
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input

# Create a workspace
WORKSPACE = './ws_resnet50'
os.makedirs(WORKSPACE, exist_ok=True)

# Prepare export directory (old one removed)
model_dir = os.path.join(WORKSPACE, 'resnet50')
compiled_model_dir = os.path.join(WORKSPACE, 'resnet50_neuron')
shutil.rmtree(model_dir, ignore_errors=True)
shutil.rmtree(compiled_model_dir, ignore_errors=True)

# Instantiate Keras ResNet50 model

model = ResNet50(weights='imagenet')

# Export SavedModel
session = keras.backend.get_session(),
export_dir = model_dir,
inputs = {'input': model.inputs[0]},
outputs = {'output': model.outputs[0]})

# Compile using Neuron
tfn.saved_model.compile(model_dir, compiled_model_dir)

# Prepare SavedModel for uploading to Inf1 instance
shutil.make_archive('./resnet50_neuron', 'zip', WORKSPACE, 'resnet50_neuron')

That one API is all it takes! Impressive.

Power users will enjoy reading about the CLI compiler, neuron-cc.

Running this code produces the expected output.

(aws_neuron_tensorflow_p36) $ python compile_resnet.py<output removed>
Downloading data from https://github.com/keras-team/keras-applications/releases/download/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5
102973440/102967424 [==============================] - 2s 0us/step
<output removed>
INFO:tensorflow:fusing subgraph neuron_op_d6f098c01c780733 with neuron-cc
INFO:tensorflow:Number of operations in TensorFlow session: 4638
INFO:tensorflow:Number of operations after tf.neuron optimizations: 556
INFO:tensorflow:Number of operations placed on Neuron runtime: 554
INFO:tensorflow:Successfully converted ./ws_resnet50/resnet50 to ./ws_resnet50/resnet50_neuron

Then, I simply copy the ZIP file to an Amazon S3 bucket, probably the easiest way to share it with inf1 instances used for inference.

$ ls *.zip
$ aws s3 mb s3://jsimon-inf1-useast1
$ aws s3 cp resnet50_neuron.zip s3://jsimon-inf1-useast1
upload: ./resnet50_neuron.zip to s3://jsimon-inf1-useast1/resnet50_neuron.zip

Alright, let’s fire up one of these babies.

Predicting on Inferentia with TensorFlow

Using the same AMI as above, I launch an inf1.xlarge instance.

family, instance name, vCPUs, RAM, storage

Once this instance is up, I ssh to it, and I can view some properties using the neuron-ls CLI tool.

4 NeuronCores, as expected. The ‘east’ and ‘west’ columns show connections to other Inferentia chips: as this instance only has one, they’re empty here.

Next, I retrieve the compiled model from my S3 bucket, and extract it. I also retrieve a test image.

$ aws s3 cp s3://jsimon-inf1-useast1/resnet50_neuron.zip .
download: s3://jsimon-inf1-useast1/resnet50_neuron.zip to resnet50_neuron.zip
$ unzip resnet50_neuron.zip
Archive: resnet50_neuron.zip
creating: resnet50_neuron/
creating: resnet50_neuron/variables/
inflating: resnet50_neuron/saved_model.pb
$ curl -O https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg

Using the code below, I load and transform the test image. I then load the compiled model, and use it to classify the image.

import os
import time
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications import resnet50


# Create input from image
img_sgl = image.load_img('kitten_small.jpg', target_size=(224, 224))
img_arr = image.img_to_array(img_sgl)
img_arr2 = np.expand_dims(img_arr, axis=0)
img_arr3 = resnet50.preprocess_input(img_arr2)

# Load model
COMPILED_MODEL_DIR = './resnet50_neuron/'
predictor_inferentia = tf.contrib.predictor.from_saved_model(COMPILED_MODEL_DIR)

# Run inference
model_feed_dict={'input': img_arr3}
infa_rslts = predictor_inferentia(model_feed_dict);

# Display results
print(resnet50.decode_predictions(infa_rslts["output"], top=5)[0])

Can you guess how many lines of Inferentia specific code are present here? The answer is zero. We seamlessly use the tf.contrib.predictor API. Woohoo!

Running this code produces the expected output, and we see the top 5 classes for the image.

(aws_neuron_tensorflow_p36) $ python infer_resnet50.py<output removed>
[('n02123045', 'tabby', 0.6918919), ('n02127052', 'lynx', 0.12770271), ('n02123159', 'tiger_cat', 0.08277027), ('n02124075', 'Egyptian_cat', 0.06418919), ('n02128757', 'snow_leopard', 0.009290541)]

Now let’s see how we can deploy a compiled model using TensorFlow Serving, which is a very good option for production deployments.

Predicting on Inferentia with TensorFlow Serving

First, we need to package the model properly, and move it to a directory reflecting it’s version. We have only one here, so let’s move the saved model to a directory named ‘1’.

$ pwd
$ mkdir 1
$ mv * 1

Now, we can launch TensorFlow Serving, and load the compiled model. Once again, this is vanilla TensorFlow.

(aws_neuron_tensorflow_p36) $ tensorflow_model_server_neuron 
2019–12–13 16:16:27.704882: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: resnet50 version: 1}
2019–12–13 16:16:27.706241: I tensorflow_serving/model_servers/server.cc:353] Running gRPC ModelServer at

Once TensorFlow Serving is up and running, we can use the script below to load a test image, and send it for prediction. At the risk of repeating myself… this is vanilla TensorFlow 🙂

import numpy as np
import grpc
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.applications.resnet50 import decode_predictions
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

if __name__ == '__main__':
chan = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(chan)
img = image.load_img(img_file, target_size=(224, 224))
img_array = preprocess_input(image.img_to_array(img)[None, ...])
request = predict_pb2.PredictRequest()
request.model_spec.name = 'resnet50_inf1'
img_array, shape=img_array.shape)
result = stub.Predict(request)
prediction = tf.make_ndarray(result.outputs['output'])

Running this code produces the expected output, and we see the top 5 classes for the image.

(aws_neuron_tensorflow_p36) $ python tfserving_resnet50.py<output removed>
[[(‘n02123045’, ‘tabby’, 0.6918919), (‘n02127052’, ‘lynx’, 0.12770271), (‘n02123159’, ‘tiger_cat’, 0.08277027), (‘n02124075’, ‘Egyptian_cat’, 0.06418919), (‘n02128757’, ‘snow_leopard’, 0.009290541)]]

Diving deeper

That’s it for today. I hope I gave you a clear introduction to AWS Inferentia, and how easy it is to use it! All it took is one line of code to compile our model.

If you’d like to dive deeper, I highly recommend the excellent workshop delivered at re:Invent by my colleague Wenming Ye. One of the labs shows you how to compile a 32-bit floating point (FP32) ResNet50 model to 16-bit floating point (FP16). By reducing arithmetic complexity, this technique is known to improve performance while preserving accuracy. Indeed, on an inf1.2xlarge instance, the FP16 model delivers an impressive 1,500 image classifications per second!

As always, thank you for reading. Happy to answer questions here or on Twitter.