Original article was published by Ariel Shiftan on Deep Learning on Medium
Single line distributed PyTorch training on AWS SageMaker
How to iterating faster on your data science project, and let your brilliant idea to see the light of day
It doesn’t matter what research project I’m working on, having the right infrastructure in place is always critical part for a success. It’s very simple: assuming all other “parameters” are equal, the faster/cheaper one can iterate, the better results could be achieved for the same time/cost budget. As these are always limited, good infrastructure is usually a make or break for a breakthrough.
When it comes to data science, and deep learning specifically, the ability to easily distribute the training process on a strong HW, is a key for achieving fast iterations. Even the brightest idea requires a few iterations to get polished and validated, e.g. for checking different pre-processing options, network architectures, and just standard hyperparameters such as batch size and learning rate.
In this post I’d like to show you how easy (and cheap, if you want) it is to distribute existing PyTorch training code on AWS SageMaker using simple-sagemaker. Moreover, you’ll see how to easily monitor and analyze resource utilization and key training metrics in real time, during the training.
A complete distributed ImageNet training pipeline
I’ll be basing the demonstration on PyTorch’s official ImageNet example. In fact, I’ll be running it as-is on AWS by using just a single command!
To be more precise, I assume:
- You’ve already installed simple-sagemaker:
pip install simple-sagemaker
- Your AWS account credentials and default region are configured for boto3, as explained on the Boto3 docs
- The ImageNet data is already stored (and extracted, more on that later) on a S3 bucket, e.g.
- You’ve downloaded the training code
main.pyto the current working directory
Now, to run the training code on a single p3.2xlarge instance, just run the following command:
That’s it, take a sit while the training job is running. You may actually need to take a sleep as it should take ~8 hours for 10 epochs on the complete dataset (total ~1.2M images). Don’t worry, the trained model will be waiting for you under
output/state by the end.
Some explanation on the
shell— Run a shell task.
-p imagenet -t 1-node— Name the project and task.
-o ./output1 --download_state— A local path to download output by the end of the training + request to download the state directory (logs get downloaded by default as well).
--iis— Use the imagenet data from
s3://bucket/imagenet-dataas a input channel mapped to the
--it ml.p3.2xlarge— Set the instance type to
-d main.py— Add main.py as a dependency.
-v 280— Set the EBS volume size to 280 GB
--no_spot— Use on-demand instances instead of the spot instances (the default). More expensive, but time is what we’re trying to save here.
--cmd_line—Execute the training script.
A few notes:
- Each task is assigned a dedicated S3 path under
[Bucket name]/[Project name]/[Task name]. A few directories exist under that path, but the relevant one of us now is
state, which gets continuously synchronized to to S3 and persists over consecutive execution of the same task.
- As the script keeps checkpoints on the current working directory, we change it first to the state directory —
- World size and node rank are set based on the environment variables accessible to the worker code. A complete list of these can be found here.
--resume— Resume the training in case it was stopped, based on the saved checkpoint in the state directory.
- If the data was already on you local machine, you could use the
-iargument (instead of
--iis) to get it automatically uploaded to S3 and used as the “data” input channel.
For more information, read the documentation, or run
ssm shell -h to get the help on the command line parameters.
Lets take a look on the end of the output logs at
* Acc@1 46.340 Acc@5 72.484
Total time: 26614 seconds
We achieved 46.3 top-1 accuracy and 72.484 top-5 accuracy, with total training time of 26,614 seconds = ~7:23 hours. The total 8 hours of running time is due to an overhead, mostly due to downloading (and extracting) the input data (~150 BG). Roughly 3–4 minutes out of it are due to the “standard SageMaker overhead” — launching and preparing the instance, downloading the input data and the training image. This can get a bit longer when training on spot instances.
Going back to the time budget you have, ~8 hours may be too much for you to wait. As explained above, it may even mean in some cases that your brilliant idea isn’t going to see the light of day :(.
Luckily, the ImageNet example code is well written, and can easily be accelerated by distributing the training on a few instances. Moreover, it’s going to be just a single additional argument!
That’s it again. Setting the instance count to 3(
--ic 3) is all you need to get the same job done in ~3:51 hours!
Looking on the the output logs
./output2/logs/log0 we see that the same accuracy is achieved, for less than half of the trainig time — 11,605 seconds = ~3:13 hours!
* Acc@1 46.516 Acc@5 72.688
Total time: 11605 seconds
Monitoring the training process
So, you’ve saved the time, and your brilliant idea is closer to the light of day. But, how can you easily watch and monitor the progress to make the chances even higher?
First, take a look on the SageMaker training jobs console, and select the
3-nodes job. Here should be able to to get all the available information about it:
There’s much more information available there, including e.g. links to the state (checkpoints) and output on S3, feel free to explore it.
The most interesting part for us now is the “Monitor” section where you can get graphs of instance utilization (CPU, RAM, GPU, GPU memory, Disk) and algorithm metrics (more about it in a second).
Links to the full logs and to a dynamic CloudWatch graph system for the instance and for the algorithm metrics are on the top of that section, and are the way to go for the analysis.
Here’re the instance metrics from the two training cycles:
The graphs get updated in real time, and should allow you to make sure everything is going as expected, and whether there’re any crucial bottlenecks that should be taken care of.
The rule of thumb is to make sure the GPU load is close to 100% most of the time. Taking our 3 nodes example in the graph above, it can be seen that the GPU is only ~70% loaded, which means we may be able to get more from that HW. A good guess may be to have more data loading worker threads, to push data faster toward the GPU.
To further simplify the monitoring of the training process, we can use SageMaker metrics to get training specific metrics in real time. Again, we just need to update a few more parameters:
--md is used to define the metrics, where the first argument is the name, e.g.
"loss" and the second is a regular expression that extracts the numeric value from the output logs.
Fast forward, here’re the algorithm metrics from the two training cycles:
As the full dataset is ~1.2M images with a total size of almost 150GB, downloading it locally, then uploading to S3 bucket is going to take a lot of time. In addition, synchronizing that many files from S3 to the training instances before training will be very slow as well. The following strategy is used to overcome these issues:
- The data is downloaded using a dedicated processing task, equipped with a much faster S3 connection.
- The data on S3 is kept in ~1000 .tar files, and get extracted on the training instance, right before the training.
The processing task can easily be launched by using the following
download.sh is a bash script that downloads and arranges the data to the path it gets on the first argument (
$SSM_OUTPUT/data). Once that task is completed, the
data directory with
val subfolders get placed under the
output directory of the task dedicated S3 path
[Bucket name]/[Project name]/[Task name]. The training command should now be updated as well:
Two changes were introduced here:
- In order to “chain” the two tasks, making that output of the processing task be the input of the training task, the argument
--iitis now used instead of
--iis. This maps the
valsubfolders to input channels with the same names, accessible by the
- The shell script
extract.shis used to extract the data right before the training.
A complete pipeline that downloads the data and executes a few training alternatives can be found on the simple-sagemaker repository.
An easy distributed model training setup is a must-have tool for your data science projects. It’s now at the reach of your hand, simpler and easier than you ever thought. Utilize it!