Getting Started with Deep Learning as Service (DLaaS) on the IBM Cloud

Source: Deep Learning on Medium

Go to the profile of Utkarsh Desai

Co-Author: Naveen Panwar

A typical machine learning pipeline would generally consist of the following stages:

  1. Data Collection, Pre-processing and Feature Extraction
  2. Training a Machine Learning model
  3. Deploying the model
  4. Invoking the model (for inference)

With the advent of deep learning, the first step no longer exists or comprises of minimal data cleaning. A deep learning model pipeline focuses more on training the model using different configurations of architectures and hyper-parameters, until a desirable performance is achieved. Model deployment usually comes later and is largely ignored or left for the ‘deployment’ team to handle.

Data scientists typically want to try out a large number of models quickly, using their favourite deep learning framework (Tensorflow, Torch, Caffe) and in order to get the best accuracy, these models are trained on huge amounts of data. On the other hand, business owners want to roll out new product features using cutting edge AI algorithms. Neither of these parties, however, want to manage clusters of machines, support multiple frameworks and handle problems such as distributed computing, load balancing, security, failures, etc.

The Deep Learning as a Service (DLaaS) offering by IBM provides a way to make the process of training and using deep learning models several times easier. Leveraging rapid innovations in hardware and across the full software and systems stack, DLaaS provides a serverless user experience and combines the flexibility, ease-of-use, and economics of a cloud service with the compute power of deep learning. With easy to use REST APIs, one can quickly train deep learning models with varying amounts of resources as per their requirements, or budget.

Not only does DLaaS support a huge variety of common Deep Learning frameworks, several tools and services are also available that make it possible to easily train and deploy deep learning models via high performance distribution over powerful accelerator hardware.

This article is not aimed to dive deep into the design and architecture of DLaaS, but to provide a quick-start guide to training and using deep learning models on the cloud.

A Simple DLaaS Flow

A typical flow of operations using the DLaaS framework is as follows:

  1. Create your Deep Learning model using one of the supported frameworks (Caffe, Tensorflow, Torch)
  2. Upload training data to Cloud Object Storage
  3. Specify model metadata, compute configuration and training data location
  4. Start training
  5. Monitor training and fetch logs
  6. Obtain the trained model, deploy and use for inference

All of the above steps can be performed via a simple command line interface (some can be done via the browser as well) which we will see in the following sections. For simplicity, we will use the IBM Bluemix web interface for some operations in order to explain them better.

We will now work through a simple scenario of training a model on DLaaS. As an example, we will build a simple classifier for the MNIST dataset. To enable this, we will setup our environment, upload the data to be used for training, create a simple Convolutional Neural Network model and submit a job for training this network. Once this is done, we will monitor the status of the job and when the model is trained, we will deploy it for use.

1. Setting up the Environment

To get started, we need to set up our environment and install some packages. But before that, we need an IBM Bluemix account. Go to

and create an account.

Next, install the ibmcloud command line interface:

curl -sL | bash

Verify everything went smoothly:

ibmcloud dev help

You should see a list of available commands. Then, install the Machine Learning plugin:

ibmcloud plugin install machine-learning -r Bluemix

Finally, login to IBM CLOUD

ibmcloud login

2. Creating an ML Instance

Next, we need to create a Machine Learning instance on Bluemix that will be responsible for running and managing our machine learning jobs on the cloud. We will be using the browser to create the ML instance, although we can do the same thing via the command line as well.

Once we have created an ML instance, move over to the Service Credentials tab and create a new set of Credentials. Fetch the credentials by clicking ‘View Credentials’ and set them in environment variables:

export ML_INSTANCE=your_instance_id
export ML_USERNAME=your_username
export ML_PASSWORD=your_password
export ML_ENV=

Introduction to IBM Cloud Object Storage

Now that we have created an ML environment for our jobs to run, we need to get our training data on the cloud. For this purpose, we will be using IBM’s Cloud Object Storage(COS) which allows upload and download of data on the cloud, along with several other features.

COS allows storage of data into buckets, which can be treated as individual storage partitions on the cloud, within our cloud storage instance. There is no limit on the number of buckets that can be created, only on the amount of data that can be stored in them and uploaded/downloaded, governed by your plan. Your Deep Learning code will read and write to these buckets. For our example, we will assume the training data along with the class labels is available in a pickle file, mnist.pkl, locally.

Setting up IBM Cloud Object Storage:

  1. Go to and create a resource of type Object Storage
  2. Give it a name and select a Plan to start with
  3. Create service credentials to access the service with parameter {“HMAC”:true} in “Inline Configuration Parameters”

4. Once created, make a note of the COS credentials by clicking on ‘View Credentials’. This is very important.

5. To actually store the data, we need to create Buckets, which act like independent storage partitions. We will create two buckets, one for input data for our model and another to write outputs. The bucket names have to be unique across users. A good naming convention for buckets could be something like UsernameProjectnameTrainingData and UsernameProjectnameResultData . We create two buckets for our example, naveenmnisttrainingdata and naveenmnistresultdata.

6. Upload mnist.pkl to the naveenmnisttrainingdata bucket.

3. Writing DLaaS compatible code

Before writing any code, if you want to be sure your favourite Deep Learning framework is available on the cloud, you can run the following command and get a list of available frameworks:

ibmcloud ml list frameworks

Or you can also check the list of supported frameworks here. You can have any deep learning code executing on the cloud as long as you take care of one specific requirement. The executor on the cloud uses two system variables to determine where to read the training data and any other files from and where to write intermediate results and system logs and save the model.

During local development however, you need not have these system variables defined. Hence, it is recommended to use the following convention when writing code for DLaaS.

data_dir = os.environ.get(“DATA_DIR”, “./my_local_data_dir”)
result_dir = os.environ.get(“RESULT_DIR”, “./my_result_data_dir”)

Using the data_dir and result_dir variables in the rest of the code for input and output directories will make sure your code can run seamlessly on the cloud as well as during local development.

Remember we uploaded the mnist.pkl file earlier? Here is how you can read it in your code after the above 2 lines:

import pickle
import os
import os.path
with open(os.path.join(data_dir, 'mnist.pkl'), 'r') as f:
trainx, trainy, validx, validy = pickle.load(f)

After this, you can create your next state-of-the-art deep learning model (or a simple CNN for that matter) as usual. If you wish to save some output, or save your model, you can do it in a similar fashion using the path in the result_dir variable, which eventually gets saved to COS.

Now, you may ask — But I had uploaded the mnist.pkl into the bucket naveenmnisttrainingdata. How can it read the file without me specifying which bucket to look in? And how can I be sure the output is written to my result bucket, naveenmnistresultdata?

This is where we move on to the next step.

4. Creating the training definition/specification YAML

Before we can submit out training job, we need to do just one more thing. We need to be able to tell the executor on the cloud what script to run, what version of Python we would like to use, where to read the training data from and where to write the result. This information will be also be used to initialize the system variables we used in our code in the previous step.

We use a specification file in the YAML format to define our training job. You can get a sample YAML by running the following command:

ibmcloud ml generate-manifest training-definitions

Now, we need to edit this sample file and fill in information corresponding to the job we want to submit. Following are some of the important properties that should be modified in the sample file:

  1. Model name and Description
  2. Frameworks and Runtimes
  3. Command to execute
  4. Compute Configuration
  5. Source data reference (COS Bucket name and Credentials) — For Input
  6. Target data reference (COS Bucket name and Credentials) — For Output

5. Submitting the Job

With the python scripts and YAML specification, we are now ready to submit out code for execution on the cloud. We simply need to zip all of our code and invoke a console command specifying the zip file and the YAML configuration file to be used.

zip ./*
ibmcloud ml train dlaas_train.yml

The output of this command will be a Model ID which is a unique identifier for our job on the cloud. Make a note of the Model ID because this is what we will use to refer to and query our job.

Starting to train …
Model-ID is ‘model-asf123asd

6. Monitoring the Job

We can find the list of all the training jobs we have submitted with the following command:

ibmcloud ml list training-definitions

To fetch the status of a particular job, we can use the Model ID obtained after submission in the previous step:

ibmcloud ml show training-runs model-asf123asd

The output might look something like this:

********************************************************************Starting to fetch ‘status messages and metrics’ for model-id ‘model-asf123asd’
********************************************************************[ — LOGS] training-NVo51RQig:
[ — LOGS] training-NVo51RQig: Training with training/test data at:

You can monitor the status of the job until it says the job has Completed. The Results bucket you specified in the YAML file will have the execution info, training logs, the trained model (if you saved it) and any other output you may have generated. This is what the COS bucket might look like (note the highlighted model .h5 file, which may then be deployed, downloaded or used in another job):

7. Storing the model

Once the training is complete, we can store the model into a WML repository:

ibmcloud ml store training-runs model-asf123asd

The output might be something like this:

Starting to store the training-run ‘model-asf123asd’…
Model store successful. Model-ID is ‘asf123asd-1234–1234–1234-asf123asd

Note the Model ID of the stored model.

8. Deploying model

Most of the projects typically end by this point and how the trained model is used varies.

Now that the model is saved, you can deploy it for use with the following command:

ibmcloud ml deploy <model-id>

Where, the model-id is what was output during storing the model. The output has some important information:

Deploying the model with MODEL-ID ‘asf123asd-1234–1234–1234-asf123asd’..
DeploymentId 123asd123-adf-1234-asdf-3123asd123
Scoring endpoint

You can use the Deployment Id when trying to score via the ibmcloud CLI or the scoring endpoint otherwise.

9. Invoking the model for Scoring/Inference

Given a new input data point/image/sample, in order to obtain the prediction/score of our Deep Learning model on the input, we first need to create a payload JSON:

"modelId": 'asf123asd-1234-1234-1234-asf123asd'.,
"deploymentId": "123asd123-adf-1234-asdf-3123asd123",
"payload": {
"values": [
<mnist image as numpy array>

To score via the CLI, run the following:

ibmcloud ml score scoring_payload.json

For a Keras model, the output might look like this:

Fetching scoring results for the deployment ‘123asd123-adf-1234-asdf-3123asd123’ …
‘fields’: [‘prediction’, ‘prediction_classes’, ‘probability’],
‘values’: [[[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], 6,
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]]]

If you need more information about deploying the model and using it for scoring new data, look at the official docs:


IBM Deep Learning as a Service provides a convenient tool for developers and business users alike to build and use deep learning models, without worrying about environment configurations, cluster and resource management or tools and packages. All you need to do is write your code and provide the data it uses and let the cloud environment handle the execution.