Original article was published by Shiftan on Deep Learning on Medium
Simple-Sagemaker to the rescue
Simple-sagemaker allows you to take your existing code, as is, and run it on the cloud, with no or very little code changes.
The remaining of this post shows how to use the library for general purpose processing. Follow up posts will demonstrate how to use it for more advanced cases, such as PyTorch distributed training etc.
- Python 3.6+
- An AWS account + region and credentials configured for boto3, as explained on the Boto3 docs
pip install simple-sagemaker
Running a shell command
Now, to get the shell command
cat /proc/cpuinfo && nvidia-smi run on a single
ml.p3.2xlarge spot instance, just run the following
ssm command (documentation of the
ssm CLI is given below):
ssm shell -p ssm-ex -t ex1 -o ./out1 --it ml.p3.2xlarge --cmd_line "cat /proc/cpuinfo && nvidia-smi"
Once the job is completed (a few mins), the output logs get downloaded to
As you may be able to guess, with this single command line, you got:
- A pre-built image to use for running the code is chosen (Sagemaker’s PyTorch framework image is the default).
- An IAM role (with the default name
SageMakerIAMRole) with AmazonSageMakerFullAccess policy is automatically created for running the task.
ml.p3.2xlargespot instance is launched for you, you pay just for the time you use it!
Note: this is a bit more than the execution time, as it include the time to initiate (e.g. download the image, code, intput) and tear down (save outputs) etc.
- The shell command get executed on the instance.
- The shell command exist code is 0, so it’s considered to be completed succesfully.
- The output logs get saved to CloudWatch, then downloaded from CloudWatch to the
Pretty cool isn’t it?
Distributing Python code
Similarly, to run the following
ssm_ex2.py on two ml.p3.2xlarge spot instances:
Just run the below
ssm run -p ssm-ex -t ex2 -e ssm_ex2.py -o ./out2 --it ml.p3.2xlarge --ic 2
The output is saved to
As you may have guessed again, here you get also the following
- The local python script is copied to a dedicated path (
[Bucket name]/[Project name]/[Task name]/[Job Name]/source/sourcedir.tar.gz) on S3 (more details here)
- Two spot instances get launched with the code from the S3 bucket
A fully* featured advanced example
And now to an advanced and fully (well, almost 🙂 ) featured version, yet simple to implement. Note: As we customizing the image, docker engine is needed.
The example is composed of two parts, each of them demonstrates a few features. In addition, the two parts are “chained”, meaning that part of the output of the first one is an input to the second one.
In order to exemplify most of the features, the following directory structure is used:
| |-- internal_dependency
| | `-- lib2.py
| |-- requirements.txt
| `-- ssm_ex3_worker.py
| |-- sample_data1.txt
| `-- sample_data2.txt
- code — the source code folder
- internal_dependency — a dependency that is part of the source code folder
- requirements.txt — pip requirements file lists needed packages to be installed before running the worker
2. data — input data files
3. external_dependency — additional code dependency
We’re going to use to tasks, the first gets two input channels
This task gets two input channels:
- Data channel — A local path on
./datathat is distributed among the two instances (due to
- persons channel — A public path on S3
The following is demonstrated:
- Name the project (
-p) and task (
- Uses local data folder as input, that is distributed among instances (
ShardedByS3Key). That folder is first synchronized into a directory on S3 that is dedicated to that specific task (according to the project/task names), and then fed to the worker. If you run the same task again, there’ll be no need to upload the entire data set, just to sync it again.
- Uses a public s3 bucket as an additional input (
--is). A role is automatically added to the used IAM policy to allow that access.
- Builds a custom docker image (
-- aws_repo_name), the
sklearnlibraries will be available to the worker. The base image is taken automatically (PyTorch framework is the default). The image is first built locally, then uploaded to ECR to be used by the running instances.
- Hyperparameter task_type. Any extra parameter is considered as a hyperparameter.
- Two instances (
--ic) are launced.
- Clear the current state before running (
--cs), to make sure we run the task again (as part of this example).
- Use an on-demand instance (
- Usage of
requirements.txt— as it’s part of the source code folder (
-e) it’s being installed automatically before running the worker.
internal_dependectyfolder is copied as part of the source code folder.
The worker code:
The worker can access it’s configuration by using the
WorkerConfig object, or the environment variables. For example:
worker_config.channel_data— the input data
worker_config.channel_persons— the data from the public s3 bucket
worker_config.instance_state— the instance state, maintained between execution of the same task
In this case, the worker “processes” the files from the input channel data into the model output folder, and writes an additional file to the output data folder.
The complete configuration documentation can be found here.
The second task gets two input channels as well:
- ex3_1_model channel — the model output from the first task
- ex3_1_state channel — the state of the first task
The following additional features are demonstrated:
- Chaining — Using outputs from part 1 (
— iit) as input for this part. Both the model output and the stare are taken.
- Uses additional local code dependencies (
- Uses the TensorFlow framework as pre-built image (
- Tags the jobs (
- Defines a Sagemaker metric (
The code can access its input data channels using
In addition, the score logs get captured by the
"Score=(.*?);" regular expression in the
ssm command above, then the metrics graphs can be viewed on the AWS console:
We saw the below
The complete code
We can put the two worker in a single file, using the
task_type hyperparameter to distinguish between the two types of execution:
I hope I managed to convey the simplicity message, and to convince you to try simple-sagemaker next time you need a stronger HW for you processing script.
Let me know if you liked it by applauding below / staring the github project.