Deploying an externally trained deep learning model for batch inference in AWS

Original article was published by Nicole Ramirez on Deep Learning on Medium

Step 1: Package the saved model file and upload to S3

Following training, our saved model (Requirement 1 above) needs to be compressed into a model.tar.gz file in our local system, which is the format SageMaker recognises. We then transfer the compressed model to AWS by uploading it to the S3 bucket we’ve created previously. We copy the file address of our model as we’ll need to reference it later on in our notebook:

Inside our S3 bucket, we click on our uploaded model file, and in the dialog box that appears, we click ‘Copy Path’

Step 2: Going into script mode: creating our main entry-point script

In the process of training our model, we‘d have written code to read data in, train the model, validate the model, do inference with it, and format the results to a final output. We have to transfer these functionalities to AWS by refactoring them into an entry point script (Requirement 3 above) invoked by the model’s Docker container when we first initialise the model in SageMaker. This script will hold everything needed for your model to perform inference, or be trained.

When your model is deployed through deploy() or transform() commands (more on this later), SageMaker starts your model server inside a Docker container, as previously mentioned.

The server then loads and uses your model by invoking a series of specific functions that have default implementations we override with our own. The first function, called model_fn() handles loading the model to the server.

Below is our implementation for loading a BERTForSequenceClassification model into the PyTorch model server with model_fn(). It returns a model loaded onto the correct device (i.e. GPU or CPU, dependent on availability)

Incoming requests to the model may be one of two types — a request for inference (which is our case here), and a request for re-training (which we don’t cover here). For the former, requests are handled by the server in 3 steps, with the associated functions described:

  1. First, the input data within the request is processed byinput_fn()
  2. The result of 1.) is passed on to predict_fn() which contains code for inference. This code makes use of the model loaded in model_fn()
  3. An optional function (i.e. output_fn())gets the result of 2.) and post-processes it prior to transfer to S3 storage, or display somewhere else.

Here’s our implementation for each of the functions above:

So, to recap — a model deployed for inference must implement model_fn() , input_fn(), predict_fn(), and optionally, output_fn() in its entry-point script.

Step 2.5: Other tips for the entry-point script

  • Adding logging events to the script will help debug any issues we might encounter during runtime. In our script, we implement logging through python’s logging library. We implement logging in our entry-point script via the below pieces of code:
  • Remember how the model server is run inside a container? Inside the container’s opt/ml directory, the model and its files are expected to be laid out in the following structure:

|- Epoch-6.model
|- code/
|- requirements.txt
  • When we first initialise a model, the SageMaker Containers library extracts the artefact from the model.tar.gz file we created in Step 1.), and transfers the uncompressed artefact (i.e. Epoch-6.model above) to opt/ml/model. The entry-point script is then transferred to opt/ml/model/code. This latter directory is where the containers library looks for python scripts to run, plus the requirements.txt file. So we add code at the top of our entry-point script (i.e. to transfer any other model scripts and the requirements.txt file to this directory.

Step 3: Creating the requirements.txt file

This file specifies any package dependencies and their versions for our model script(s). To populate this file, look at the release notes for the deep learning container you’re using (here are the release notes for the PyTorch v.1.6.0 container image we’ve used), and look at what packages are included within. Any package or package version that’s not included but is currently used by any of your model script(s) would have to be specified in requirements.txt. The container downloads these packages before running your scripts.

Step 4: Import packages in Notebook Instance

Whilst we can create a model and a batch transform job using the SageMaker Console, I prefer doing so with SageMaker APIs in a notebook instance. This allows me to have the workflow all in one place, and view log messages created by my scripts in the notebook itself during runtime. The rest of this step is demonstrated via those APIs.

We start set-up by importing necessary packages:

Step 5: Get session, role, and bucket information

  • sagemaker_session : The session object that manages interactions with SageMaker APIs and any other AWS service that this inference job uses.
  • role: The IAM policy that controls access to other AWS services. Is the SageMakerExecutionRole by default.
  • bucket: The name of an S3 bucket we’d like results to be stored into.

Step 6: Initialise PyTorch Model

We initialise a PyTorchModel for inference above, calling the PyTorchModel estimator, with documentation here. The path to our model.tar.gz file copied in Step 1.) is specified, as is our IAM role and AWS PyTorch container version (i.e. the version of PyTorch we used to write our model scripts). Specifying source_dir tells SageMaker to look into a folder called bert-sa-scripts in our current working environment, and find the entry point script, there. As mentioned in Step 2.5), this script is transferred to the container’s opt/ml/code directory when this code executes.

Step 7: Create a Transformer from pytorch_model

SageMaker’s Transformer handles transformations, including inference, on a batch of data. We use it instead of an Estimator in deploying our model, because while an Estimator does predictions on a single input, a Transformer does it for multiple inputs. In SageMaker, a model deployed to a real-time endpoint via the deploy() command uses an Estimator, whereas a model deployed for serverless offline predictions with the transform() command uses a Transformer.

In initialising a Transformer, as the documentation shows, the first parameter is the name of the model we’ve initialised in Step 6. We either provide this name as a string, or alternatively call the .transformer(...) method on the model object directly as we’ve done above. We specify the transformer’s EC2 instance type, which gets started only when a batch transform job runs, making batch transforms truly serverless! You can read about the other parameters in the linked API, but the session and bucket we’ve specified in Step 5.) are used here.

Step 8: Run a batch transform job!

Up until this point, we’ve initialised a PyTorch model and used that model to initialise a transformer. We’re now ready to use that transformer to perform batch inference on our data! Our data currently sits inside a .csv file in the sagemaker-bert-pytorch S3 bucket we’ve alluded to in Step 5.).

Below is some helper code to list the contents of that bucket:

Once we’ve copied the filename of our data from the output of the above code, let’s paste it into the .transform(...) call, which starts our batch inference job

Here’s the API documentation for the .transform(...) function. The .wait() function causes logs in our scripts to be printed out as a docstring into the output cell of the notebook, useful for seeing where our programme is at.

Logs printed out in the output cell

I found it useful to include a log indicating whether the predictions were successfully generated in my model script. This log was inside my output_fn() call and prints the number of entries which predictions were generated for.

Step 9: Check the predictions

When the .transform(...) call is finished, we can check the output by calling the following command within our notebook:

!aws s3 cp --quiet --recursive $transformer.output_path ./batch_predictions

Which tells AWS to copy the output of the transformer existing in the output_path which was set in Step 8.) to a folder called batch_predictions in the current notebook’s working environment. This is so we can inspect it without having to go into S3. --quiet disables any printed logs while the process is running, and --recursive indicates that we copy folders as well, if they exist.

Checking that folder, we see the output file, which looks like this:

Confirming that our batch inference job did generate predictions successfully!


In this post we’ve covered how to deploy a deep learning model trained outside AWS in an AWS-managed deep learning container. To further automate this set-up, we can configure our batch transform job to automatically start every x period of time, or when a new .csv file hits the S3 bucket, instead of manually like we did in this SageMaker notebook. I will cover this in a future post, so stay tuned!

Special thanks to our Senior Dev, Denis Tereshchenko for proof-reading this article in full!