Running your processing job on the cloud easily and cheaply

Original article was published by Shiftan on Deep Learning on Medium


Simple-Sagemaker to the rescue

Simple-sagemaker allows you to take your existing code, as is, and run it on the cloud, with no or very little code changes.

The remaining of this post shows how to use the library for general purpose processing. Follow up posts will demonstrate how to use it for more advanced cases, such as PyTorch distributed training etc.

A more comprehensive documentation and examples can be found on the github project, along with a few more examples.

Requirements

  1. Python 3.6+
  2. An AWS account + region and credentials configured for boto3, as explained on the Boto3 docs

Installation

pip install simple-sagemaker

Running a shell command

Now, to get the shell command cat /proc/cpuinfo && nvidia-smi run on a single ml.p3.2xlarge spot instance, just run the following ssm command (documentation of the ssm CLI is given below):

ssm shell -p ssm-ex -t ex1 -o ./out1 --it ml.p3.2xlarge --cmd_line "cat /proc/cpuinfo && nvidia-smi"

Once the job is completed (a few mins), the output logs get downloaded to ./out1 :

As you may be able to guess, with this single command line, you got:

  1. A pre-built image to use for running the code is chosen (Sagemaker’s PyTorch framework image is the default).
  2. An IAM role (with the default name SageMakerIAMRole) with AmazonSageMakerFullAccess policy is automatically created for running the task.
  3. A ml.p3.2xlarge spot instance is launched for you, you pay just for the time you use it!
    Note: this is a bit more than the execution time, as it include the time to initiate (e.g. download the image, code, intput) and tear down (save outputs) etc.
  4. The shell command get executed on the instance.
  5. The shell command exist code is 0, so it’s considered to be completed succesfully.
  6. The output logs get saved to CloudWatch, then downloaded from CloudWatch to the ./out1 folder.

Pretty cool isn’t it?

Distributing Python code

Similarly, to run the following ssm_ex2.py on two ml.p3.2xlarge spot instances:

Just run the below ssm command:

ssm run -p ssm-ex -t ex2 -e ssm_ex2.py -o ./out2 --it ml.p3.2xlarge --ic 2

The output is saved to ./out2:

As you may have guessed again, here you get also the following

  1. The local python script is copied to a dedicated path ([Bucket name]/[Project name]/[Task name]/[Job Name]/source/sourcedir.tar.gz) on S3 (more details here)
  2. Two spot instances get launched with the code from the S3 bucket

A fully* featured advanced example

And now to an advanced and fully (well, almost 🙂 ) featured version, yet simple to implement. Note: As we customizing the image, docker engine is needed.

The example is composed of two parts, each of them demonstrates a few features. In addition, the two parts are “chained”, meaning that part of the output of the first one is an input to the second one.

In order to exemplify most of the features, the following directory structure is used:

.
|-- code
| |-- internal_dependency
| | `-- lib2.py
| |-- requirements.txt
| `-- ssm_ex3_worker.py
|-- data
| |-- sample_data1.txt
| `-- sample_data2.txt
`-- external_dependency
`-- lib1.py
  1. code — the source code folder
  • internal_dependency — a dependency that is part of the source code folder
  • requirements.txt — pip requirements file lists needed packages to be installed before running the worker
    transformers==3.0.2

2. data — input data files

3. external_dependency — additional code dependency

We’re going to use to tasks, the first gets two input channels

First task

This task gets two input channels:

  1. Data channel — A local path on ./data that is distributed among the two instances (due to ShardedByS3Key)
  2. persons channel — A public path on S3

The following is demonstrated:

  1. Name the project (-p) and task (-t).
  2. Uses local data folder as input, that is distributed among instances ( — i, ShardedByS3Key). That folder is first synchronized into a directory on S3 that is dedicated to that specific task (according to the project/task names), and then fed to the worker. If you run the same task again, there’ll be no need to upload the entire data set, just to sync it again.
  3. Uses a public s3 bucket as an additional input (--is). A role is automatically added to the used IAM policy to allow that access.
  4. Builds a custom docker image (--df,--repo_name,-- aws_repo_name), the pandas and sklearn libraries will be available to the worker. The base image is taken automatically (PyTorch framework is the default). The image is first built locally, then uploaded to ECR to be used by the running instances.
  5. Hyperparameter task_type. Any extra parameter is considered as a hyperparameter.
  6. Two instances ( --ic) are launced.
  7. Clear the current state before running (--cs), to make sure we run the task again (as part of this example).
  8. Use an on-demand instance (--no_spot).
  9. Usage of requirements.txt — as it’s part of the source code folder (-e) it’s being installed automatically before running the worker.
  10. Theinternal_dependecty folder is copied as part of the source code folder.

The worker code:

The worker can access it’s configuration by using the WorkerConfig object, or the environment variables. For example:

  • worker_config.channel_data — the input data
  • worker_config.channel_persons — the data from the public s3 bucket
  • worker_config.instance_state — the instance state, maintained between execution of the same task

In this case, the worker “processes” the files from the input channel data into the model output folder, and writes an additional file to the output data folder.

The complete configuration documentation can be found here.

Second task

The second task gets two input channels as well:

  1. ex3_1_model channel — the model output from the first task
  2. ex3_1_state channel — the state of the first task

The following additional features are demonstrated:

  1. Chaining — Using outputs from part 1 ( — iit) as input for this part. Both the model output and the stare are taken.
  2. Uses additional local code dependencies (-d).
  3. Uses the TensorFlow framework as pre-built image (-f).
  4. Tags the jobs ( -- tag).
  5. Defines a Sagemaker metric (-m,-- md).

The code can access its input data channels using worker_config.ex3_1_state and worker_config.ex3_1_state.

In addition, the score logs get captured by the "Score=(.*?);" regular expression in the ssm command above, then the metrics graphs can be viewed on the AWS console:

Summary

We saw the below

The complete code

We can put the two worker in a single file, using the task_type hyperparameter to distinguish between the two types of execution:

Conclusions

I hope I managed to convey the simplicity message, and to convince you to try simple-sagemaker next time you need a stronger HW for you processing script.

Let me know if you liked it by applauding below / staring the github project.