Distributed MXNet, Can It Be Scaled Linearly?

One of the main claims of MXNet is the speed! Especially in a distributed mode. So, I decided to do my own investigation in order to answer the question: can the MXNet scale up linearly in a distributed mode?


In order to be able to understand how exactly the measurements were done, one needs to be familiar with the material of my previous article: “MXNet Distributed Training Explained In Depth” or, at least, to have knowledge about how MXNet distributed training works.

Also, do not expect this to be an scientific research, since all this was done with limited budget of my personal wallet and therefore not verified with enough amount of executions to make the numbers statistically reliable. In other words, results obtained by measuring once without any information about the deviation, mean and other statistical data that can prove that the results are reliable. Though, it is kind of similar what almost anyone is doing this days, and this makes me sad 🙁 Anyway…

Testing Setup

  • AWS DeepLearning AMI (v 5.0) — based on Ubuntu 16.04;
  • MXnet 1.1.0 (cu90mkl) — comes pre-baked with the AMI;
  • two p3.16xlarge (8 Volta on each);
  • one p3.8xlarge (4 Volta);

The only reason to have third machine smaller — my budget limitation. Volta costs money! Overall I have tried to stay within 50$ budget for my weekend research (and I managed to do so 🙂 ).

What We Will Be Testing

MXNet provides an ImageNet training script that can be executed in a benchmark mode. This one. We are going to run it in the following way:

As can be seen:

  • we are training resnet-v1;
  • FP32 mode, since distributed training DOES NOT SUPPORT FP16!!!!!!!!!! (feature request already out);
  • Python 3.6 is used;

Everything in the command will be fixed for all executions, except 3 things:

  • GPU that we will be using to train — in the example case it is 8 of them;
  • Batch size for training — calculated as 120 * amount of GPUs; **there is no any magic here the number is just calculated based on empirical observations that allow us to maximize utilization of the Volta RAM.
  • KVstore — if you do not know what is KVStore, please read my other article first.


Our base measurement will be measurement of training on one machine. By increasing the amount of GPUs that we are using for training we can build a chart and prolongate the chart to the amount of GPUs beyond the capabilities of the instance. This will be our ideal case scenario.

Without further ado, here is the speed measurement with 1 to 8 Voltas:

In ideal world with unicorns, ponies, and butterflies we should see such a linear scale even if we are adding new machines to the cluster. Something like this:

There are several reason why in real life such scalability is not fully possible (which we will discuss later) but we still can test how close reality is to the fairy tale, right?

Distributed Training Mode

First of all, there are tons of setting in a distributed training mode. Several that are most impactfull:

  • how many servers do you have per worker?
  • what is the kvstore mode used, is it sync or async?
  • how many workers do you have?

We will run our measurements in different configurations. But before even go beyond one machine let’s answer to one particular question: how much slower the training will be, on one machine, if we run it in a training mode?

To be honest, I was surprised when I saw how close is the line to the base line. The thing is, now you have network. Yes everything stays on the localhost, but still, network is involved. Plus, if my memory is correct, 0.11 used to have ~10% speed drop in such setup.

Just compare the code that doing Push on distributed kvstore with the similar place on a local kvstore.

Now, I think it is time to make our hands really dirty. We are going to start the first distributed experiments with the following configuration:

  • one server (co-hosted with the worker),
  • two workers,
  • kvstore in a sync mode.

1 Server 2 Workers Sync Mode

First let me show you this nice graph and we can discuss it in details:

When you’re looking on the chart for the first time it might be tricky to make sense out of it, but let me explain what is going on here.

Everything on the left hand side from the black middle like is the same. Blue is the speed of training on the one machine and orange is the ideal speed of training. On the right hand side from the black line we now have the data from the second worker (since the only way how we can get 9 Voltas is to start using second worker). As can be seen speed on the first worker immideatly drops(to ~2400 from ~3000) when we’re adding second machine to the cluster. There might be several explanations to this:

  • first of all worker co-hosting the server, therefore server might consume some resource of the worker,
  • secondly, and this is probably the most important, we are using sync mode in our kvstore, therefore now our worker’s going to be blocked each time when it is trying to pull data from the server. However, this should not impact speed of trianing (images/sec) and should happen only during sync with the kvstore (but this is yet to be determined).

The next is the green line. Green line is the speed of training on the second worker. Ideally this line should be equal to the blue line on the left hand side. However this does not happen. For example with 1 GPU speed on the second worker is ~315 samples/sec while on the first worker it used to be ~390–400. This happens, probably, due to the same reason as speed drop for the first worker.

Yellow line is here to demonstrate combined speed of the cluster. It is equal to the sum of the green and blue lines. As can be seen you need 2 Voltas on the second worker, just to get the same speed as you used to have on the single machine. Only after adding third GPU (having 11 GPUs combined) speed is higher than with 8 GPUs!

Let’s treat it as our base line and let’s try to beat it! Now, I’m going to try different configuration.

2 Servers 2 Workers Sync Mode

With this configuration we are going to have 2 servers. Each server will host half of the model, therefore, now each of the worker should have ~equal speed of pulling/pushing the model. Our current results are following:

These results so unbelievably beautiful that I had to do measurements twice! MXNet shows almost ideal linear scaling results up 14 GPUs!

First worker shows almost no speed degradation, up till 15 GPUs are used. It is hard to tell why. I guess amount of the network communications matters. Since now each server responsible only for half of the model, worker that is co-hosted with the server is capable to update half of the model instantly. Anyway, it is hard to answer why this is happening. Now the question: can we make it even better by using async mode? Let’s give it a try:

2 Servers 2 Workers Async Mode

And the only difference on the following char that I was able to find is the word “async” in the title:

Ok, so sync/async indeed has no impact on the speed of training (though probably has some impact on the model synchronization time). Now, what if we introduce the third worker? With all current information I guess it should not be surprising that we are going to use third server as well.

3 Servers 3 Workers Async Mode

Now it is getting interesting:

As can be seen bottleneck erupts. Though it is still way-way better from what we saw in a single server mode. For example there is no cases when you need to add GPU just to reach previous speed, in current configuration speed way is always increasing with each sun-sequent GPU. Surprisingly speed goes up slightly with the 20th Voltas (on first 2 workers).

I’m afraid I do not have any numbers for you to desmonstrain how perfomance differs when we are using stand alone servers and not co-hosting them with the workers. I guess I will do such investigations next time.


MXNet is showing a great potential of scaling, almost linearly. If you are planing to build the cluster amount of servers is a key variable and looks like you should have 1 server per 1 worker.

What Is Yet To Be Solved

There are several things that I would really like to see:

  • fp16 support for the distributed training 0_0 IMHO this is critical blocker to double speed in the distributed mode, I have created the feature request, please feel free to +1 to it,
  • ability to use NCCL with the distributed training (feature request here).

Distributed MXNet, Can It Be Scaled Linearly? was originally published in Deep Learning as I See It on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium