Model Parallelism in Deep Learning is NOT What You Think

Source: Deep Learning on Medium


Distribution Axes in Deep Learning Networks

I’ve some layers in GPU-0 and others in GPU-1. Is this model parallelism?

No, unfortunately, it’s not.

Like with any parallel program, data parallelism is not the only way to parallelize a deep network. A second approach is to parallelize the model itself. This is where the confusion happens because the layers in a neural network have a data dependency on their previous layers. Therefore, just because you place some of your layers in a different device doesn’t mean they can be evaluated in parallel. Instead, what will happen is one device will sit idle while it’s waiting for data from the other device.

True model parallelism means your model is split in such a way that each part can be evaluated concurrently, i.e. the order does NOT matter. In the above figure, Machine 1 (M1) and Machine 3 (M3) shows how 2 layers are split across devices to be evaluated in parallel. It’s the same with Machine 2 (M2) and Machine 4 (M4). However, going from {M1, M3} to {M2, M4} is just splitting your workload because {M2, M4} have to wait on data from {M1, M3} to do any forward pass and vice versa in the backpropagation.

Is it Pipeline Parallism?

Well…

Again, if something to be called parallel it should have elements that can be evaluated concurrently. Pipeline parallelism, as its name suggests, means there is a stream of work items, so each worker always has something to do without having to wait for its previous or successor worker to finish their work.

When you partition your network vertically, as shown, technically it is possible to achieve pipeline parallelism. How? Well, you can stream the data items in your minibatch, where one item may be in the forward pass in layer X, while the other item is in the forward pass in layer 1. Of course, now your framework has to support such parallelism or you have to write it like that from scratch. So it’s possible but currently, I am not aware if frameworks actually stream work like this when you partition the model as such.

But many Horovod and Tensorflow answers claim this as model parallelism?

It’s confusing. Isn’t it?

Horovod has put up a gist and an answer showing how to do model parallelism but if you look closely it’s doing workload partitioning not model parallelism as we discuss here. The dropout, dense, and softmax layers that are in GPU-1 of this example will not be evaluated before the layers in GPU-0, so it’s not concurrent execution.

Tensorflow too has many StackOverflow answers and blogs from users [1, 2, 3, 4] that show workload partitioning as model parallelism.

Where can I read more about these?

Two papers

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis by Tal Ben-Nun, and Torsten Hoefler provides a clear description of these different models of parallelism. It also mentions why it’s OK to think workload partitioning as some form of model parallelism but reminds the reader that things don’t happen concurrently with this approach.

Integrated Model, Batch and Domain Parallelism in Training Neural Network by Amir et al dives into many things that can be evaluated concurrently in a deep learning network. It also presents an analytical performance evaluation of these methods.