Quick Start to Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute

Original article was published on Deep Learning on Medium

Quick Start to Distributed Multi-GPU Deep Learning on AWS Sagemaker using TF.Distribute


This article is a quick start guide to running distributed multi-GPU deep learning using AWS Sagemaker and TensorFlow 2.2.0 tf.distribute.

Photo by Markus Spiske on Unsplash


All of my code related to this article can be found in my GitHub repository, here. The code in my repository is an example of running a version of BERT on data from Kaggle, specifically the Jigsaw Multilingual Toxic Comment Classification competition. So of my code is adopted from a top public kernel.

The Need to Know Information

Getting Started

First, we need to understand our options for running deep learning on AWS Sagemaker.

  1. Run your code in a notebook instance
  2. Run your code in a tailored Sagemaker TensorFlow container

In this article, we focus on option #2 because it’s cheaper and it’s the intended design of Sagemaker.

(option #1 is a nice way to get started, but it’s more expensive because you’re paying for every second the notebook instance is running).

Running a Sagemaker TensorFlow Container

There is a lot of flexibility to Sagemaker TensorFlow containers, but we’re going to focus on the bare essentials.

Photo by Upadek Matmy on Unsplash

To start, we need to launch a Sagemaker notebook instance and store our data on S3. If you don’t know how to do this, I review some simple options on my blog. Once we have our data in S3, we can launch a Jupyter notebook (from our notebook instance) and start coding. This notebook will be responsible for launching your training job, or i.e. your Sagemaker TensorFlow container.

Again, we’re going to focus on the bare essentials. We need a variable to indicate where our data is located, and then we need to add that location to a dictionary.

data_s3 = 's3://<your-bucket>/'
inputs = {'data':data_s3}

Pretty simple. Now we need to create a Sagemaker TensorFlow container object.

Our entry_point is a Python script (which we’ll make later) that contains all of our modeling code. Our train_instance_type is a multi-GPU Sagemaker instance type. You can find a full list of Sagemaker instance types here. Notice that a ml.p3.8xlarge runs 4 V100 NVIDIA GPUs. And since we’re going to be using MirroredStrategy (more on this later) we need train_instance_count=1. So that’s 1 machine with 4 V100s. The other settings you can leave alone for now, or research further as needed.

The main settings we need to get right are entry_point and train_instance_type. (And then for Mirrored Strategy we need train_instance_count=1).

# create estimator
estimator = TensorFlow(entry_point='jigsaw_DistilBert_SingleRun_v1_sm_tfdist0.py',

We can kick off our training job by running the following line.


Notice that we included our dictionary (which contained our S3 location) as an input to ‘fit()’. Before we run this code, we need to create the Python script which we tied to entry_point (otherwise our container won’t have any code to run).

Create Training Script

I have a lot going on in my training script because I’m running a version of BERT on some data from Kaggle, but I’m going to highlight the main code required for Sagemaker.

Photo by Brooks Leibee on Unsplash

First we need to grab our data location, which was passed when we ran ‘estimator.fit(inputs)’. We can do this using argparse.

def parse_args(): 
parser = argparse.ArgumentParser()
parser.add_argument(‘ — data’,
return parser.parse_known_args()
args, _ = parse_args()

You could probably simplify this even further by just hard coding your S3 location in your training script.

If all we wanted to do was run our training job in a Sagemaker container, that’s basically all we need. Now if we want to run multi-GPU train using tf.distribute we need a few more things.

Say Goodbye to Horovod, Say Hello to TF.Distribute

Photo by Taylor Vick on Unsplash

First we need to indicate that we want to run multi-GPU training. We can do that very easily with the following line.

strategy = tf.distribute.MirroredStrategy()

We’re going to use our strategy object throughout our training code. Next we need to adjust our batch size for multi-GPU training by including the following line.

BATCH_SIZE = 16 * strategy.num_replicas_in_sync

To distribute our model we can define our model using strategy as well.

with strategy.scope():
# define model here

And that’s it! We can then continue on to run ‘model.fit()’ we usually do.

Again, full code related to this article can be found in my GitHub repository, here.

Thanks for reading and hope you find this helpful!