Running Experiments with Azure ML, Batch AI and VS Code — Tools for AI



During the last two weeks I spent some time getting a general setup going for running machine learning experiments using Azure ML, Batch AI clusters and the VS Code plugin Tools for AI. Since at times this was a rather unnerving experience

and there are significant bits missing from the relevant documentation (at least as far as I can see) I want to share some of the things I learned so that other people do not have to go through the same process.


First things first some general information about the three components working together here. Note that this is only a very rough overview and not meant as an exhaustive list of features.

Azure ML is Azure’s machine learning framework and offers a whole bunch of functionalities for the whole machine learning workflow from experiment to deployment. In particular for your experiments it offers an automated way to keep track of all of your trial runs together with the relevant metrics in a nice fashion.

Batch AI is another Azure feature that provides on-demand remote compute power for your experiments. In particular this means you can send a training script to a Batch AI cluster which will then automatically start a Docker container equipped with specified hardware, mount a filestore, run your script, log all the relevant metrics, store the finished model on the filestore, and then terminate the container. Thus you only pay for the compute you actually need.

Finally Tools for AI is a VS Code plugin that allows you to manage your ML experiments from the comfort of your familiar IDE.

While all of this sounds very nice and useful (and it absolutely is) it is not exactly trivial to actually get your experiments running. For that reason here are some tips on how to set everything up for running Azure ML experiments using a Batch AI cluster and Tools for AI.

Pretty much all of what I am going to describe I found out by trying to fix the various problems I ran into when I tried to get my own experiments running. Some parts of it can be found in the Azure ML documentation. However, in my opinion some significant bits and pieces are missing so I will try to fill them in here. If there is an easier, more robust, or more Pythonic way to do these things, please let me know; I am always happy to enhance my workflow.

Finally, I am going to assume that whoever reads this managed to set up an Azure ML workspace, installed the VS Code plugin as well as the corresponding Python environment.


Creating and running experiments

Actually creating and running Azure ML experiment from the plugin is pretty straightforward. Select the Azure plugin in VS code, under Machine Learning select your workspace, right-click on Experiments and simply follow the instructions.

If you want to run a script as part of your experiment, open the folder containing your script (say train.py) in VS Code and use the Azure plugin to attach the folder to the experiment. If you do this the plugin automatically creates a bunch of helper files (described below) and once you have edited these appropriately starting a run of your experiment is as easy as selecting the corresponding runconfig, right-clicking and selecting Run Experiment.

The plugin then opens a separate tab where you can find the log files of the current run as well as plots of any metrics you logged, updated in real time. Sadly, you have to click on a log file each time you want to see if something new was logged as there is currently no functionality to stream the information as it comes in (would be a nice feature for a future update).

Docker

Running experiments on a Batch AI cluster means you will have to use Docker. Since it significantly cuts down on the time each run of your experiment takes I highly recommend preparing a Docker image in which you pre-install most if not all of the tools and packages you usually need. You can then upload the image to the container registry that comes with your Azure ML workplace. In this case the short description you can find in the Azure portal is actually very helpful.

If you just use a standard base image from Docker hub instead you are pretty much forced to install the full conda environment you need for your experiment each time you run it and since some python packages are not exactly small (Pytorch for examples comes up to about 500MB) this might take some time.

Compute Targets

Compute targets are the various (usually) remote resources on which you can run your experiments. Here I want to focus on Batch AI clusters.

Adding a Batch AI cluster as a remote compute

From my experience here is the best way on how to do this:

  1. Create a Batch AI workspace (either through the Azure CLI or through the portal).
  2. Within this workspace create a cluster with the desired properties and remember to provide either an ssh-key or an admin password.
  3. Add this cluster as a remote compute to the Azure ML workspace using the Azure portal.

In principle it is possible to create a new Batch AI cluster directly from the VS code plugin or the ML workspace in the portal and instantaneously add it as a remote compute. While this is very convenient, there is a significant caveat. Doing this there is no possibility to add an ssh-key or admin password to the cluster and consequently you are barred from later connecting to the worker nodes to (e.g.) check on GPU usage.

Other compute targets

Just a few words on compute targets other than Batch AI clusters. You can add existing Azure VMs as compute targets (note that in this case you need to take care of mounting your data yourself, since the method described below is not applicable) or you can run your experiments locally (optionally using Docker). The latter is particularly useful for debugging your training script or if you have strong on premise hardware and simply want to take advantage of the nice logging functionalities of Azure ML.

The runconfig file(s)

The runconfig file is (other than your actual training script) the key component of running an experiment. Moreover, it is also the piece in this whole process that took me by far the longest time to get a grasp of.

When you first attach the folder containing your training script to an experiment, the plugin creates a folder called aml_config for you containing a conda environment file, a project.json file (which you can safely ignore), and two sample runconfigs for running your experiment on your local machine (with Docker or without). The runconfigs follow the yaml-format and I will spend most of the rest of this post on describing the different keys and how their values work.

script

This is simply the path to the training script you want to run (relatively to the folder you attached to the experiment). So for example:

script: train.py

arguments

This is a list of command line arguments that will be passed to your script at runtime and which you usually want to handle using an argument parser. Example:

arguments: ["--n_epochs=10", "--n_classes=3", "--verbose"]

target

This is the compute target on which you want to run your experiment (you probably want at least one runconfig per target) using the name you provided when you added the target to your workspace. Example:

target: my_cluster

framework

I will only describe how to run simple Python scripts here (other possibilities include for example Pyspark) so put

framework: Python

environment: environmentVariables

Here you can provide some environment variables to be set on your compute target before anything else happens. In particular you can reference these variables in other places in the runconfig, e.g. in the argument list. You might use this for example for specifying a data path that occurs multiple times so that you only have to change it once when your setup evolves. Example:

environment:
environmentVariables:
MY_MODEL_DIRECTORY /models/go/here/

environment: python

This is used to specify the Python environment in which your experiment is run. You have the option of either setting userManagedDependencies to false (in which case you need to specify your environment using conda_dependencies.yml) or you can set userManagedDependencies to true in which case you should specify the right environment by use of the interpreterPath key. If you use the conda base environment of your container it should look like this:

environment:
environmentVariables:
python:
userManagedDependencies: true
interpreterPath: python

Unless you have only a few reasonably small dependencies I highly recommend you do the latter. However, in principle this means that you are limited to using the packages that come pre-installed in the Docker image you chose. Since it is not exactly feasible to pre-install every package you might ever need, there is actually a (slightly hacky) workaround which I will describe later.

docker

Since we want to use a Batch AI cluster set enabled to true, set shareVolumes to true and (if you need GPU support) set gpuSupport to true. Fill in all other keys as needed (this is surprisingly much more straightforward than I expected). If you are using Batch AI and the container registry that came with your Azure ML workspace, there is no need to fill in any of the fields under baseImageRegistry. If you want to run your script in a container on an Azure VM you have to provide this information. However, all of the keys correspond in an obvious fashion to information you can obtain from the Azure portal or through the Azure CLI.

Example:

docker:
enabled: true
shareVolumes: true
gpuSupport: true
baseImage: myregistry01234:deeplearning_image

Resource allocation for the container (in particular number of CPU cores and memory) can be done further down in the runconfig. I have no idea why it is not part of the docker subsection.

dataReferences

One of the most useful options and the one that took me the longest to get right. Here you have the possibility to mount data on the remote compute, download data to the remote compute (both of these happen before the script is run), or upload data from the remote compute (after the script is finished). I will only describe here how to do this with respect to the datastore that comes with your Azure ML workspace.

The value of dataReferences must be a json-object each value of which is again a json-object describing a single data reference.

Mounting data from the datastore works as follows:

dataReferences: {
"myMount": {
"mode" : "mount",
"data_store_name": "workspacefilestore",
"path_on_data_store": "data/lies/here"
}

There is no way to influence the mount point on the remote compute; it is automatically chosen and stored in the environment variable $AZUREML_DATAREFERENCE_myMount. Note that for some reason the plugin automatically translates snake casing to (lower) camel casing. So if I had called the data reference “my_mount” instead of “myMount” the corresponding environment variable would still be $AZUREML_DATAREFERENCE_myMount.

Downloading data is currently broken and you can only download folders that are located at the base of the datastore. However, in principle it works like this:

dataReferences: {
"myDownload": {
"mode" : "download,
"data_store_name": "workspacefilestore",
"path_on_data_store": "data",
"path_on_compute": "/put/data/here"
}

As you can see in this case it is possible to influence the point where the data ends up on the remote compute (another case where you can reference your environment variables).

Apart from mounting and downloading data you can also upload data (say a fully trained model or a file containing some predictions) after your script is finished using the following syntax:

dataReferences: {
"myUpload": {
"mode" : "upload,
"data_store_name": "workspacefilestore",
"path_on_data_store": "upload/data/here",
"path_on_compute": "/finished/artifacts/",
"overwrite": true
}

The upload silently fails if the path on the datastore does not already exist so make sure to have the correct folder structure set up beforehand. “overwrite” defaults to false. If “path_on_compute” is a file the file is uploaded to the datastore, if it is a folder its content is uploaded.

Installing additional packages

As I mentioned earlier I highly suggest you build a Docker image with your default Python packages pre-installed (possibly in a couple of different environments to compartmentalize your different needs). However, it might still be the case that you need to install some additional packages at runtime. Be it that you forgot to install them beforehand and you do not want to restart the whole Docker build or that it simply was not possible to properly build the Docker image with them on your system (looking at you nvidia-docker).

Since the runconfig only gives you the possibility to build a whole new environment (in which case all your existing packages are downloaded again) we need a minor hack to actually make this possible. What we are going to do is write a pip-file describing our additional dependencies (nicely enough this also allows us to install a new version of our own packages if we have them available on our datastore) and put it at aml_config/additional_dependencies.txt. Now we put the following command at the very top of our script:

import sys
import subprocess
subprocess.call([
sys.executable,
"-m",
"pip",
"install",
"-r",
"aml_config/additional_dependencies.txt"
])

This takes care of installing the packages you need in a way that you can immediately import and use them afterwards. I would absolutely agree that this is not exactly a great way of installing packages. However, after quite a bit of googling this is still the only way I could find that actually achieved what I wanted. If you know a better one, please let me know.

Logging

Logging metrics and other things during a run is one of the nicer features of Azure ML and is pretty well described in the documentation so I only want to briefly describe some technicalities. To actually do any logging we need access to the current run of the experiment which we get as follows (assuming we installed and imported the azureml package):

run = azureml.core.Run.get_context()

Now we simply call run.log(key, value) to log values. If we call this command multiple times with the same key (say for example we log the validation accuracy once per epoch), Azure ML automatically keeps a list of all the values and displays a plot of them (in the order in which they were logged).

If you want to have a plot with one metric on the x-axis and another on the y-axis you can call run.log_row(key, name1=metric1, name2=metric2) . However, it is currently not possible to get a plot of two metrics on the same axis (say training accuracy and validation accuracy both on the y-axis displayed over time) in an automatic fashion. But since you can log matplotlib plots by using run.log_image(key, plot) you can build this functionality yourself if you really need it.

Miscellaneous

Stopping runs

For some reason the VS code plugin currently offers no way to stop an ongoing run of an experiment (definitely a must have feature for the future). If you do want to stop a run, use the plugin (or the Azure CLI) to get the run id and locate the run in the view of your Batch AI cluster in the Azure portal and stop it there.

Resizing clusters

Batch AI can automatically resize your cluster to fit your current needs (within the limits you set). Generally speaking most people usually want the minimum number of nodes the cluster keeps running to be 0 so that you really only pay for the compute you use. However, in particular while debugging your runconfig I very much suggest you keep the minimum number of nodes at 1 so that you do not waste any time with starting and stopping machines unnecessarily.

Conclusion

Azure ML, its VS code plugin and Batch AI clusters are not exactly trivial to get up and running. However, once one gets past the initial hassle the workflow and functionalities on offer are actually pretty neat and I look forward to using them more extensively in the future. I hope this post helps at least some people to avoid some of the issues I faced in the beginning.

Source: Deep Learning on Medium