Learn PyTorch Multi-GPU properly

Source: Deep Learning on Medium

Learn PyTorch Multi-GPU properly

I’m Matthew, a carrot market machine learning engineer who loves PyTorch. We’ve organized the process for multi-GPU learning using PyTorch.

The post goes like this:

  • Deep Learning and Multi-GPU
  • Using the PyTorch Data Parallel Feature
  • Using Data Parallel with Custom
  • Using Distributed Packages in PyTorch
  • Learn using Nvidia Apex
  • Compare Multi-GPU Learning Methods

After reading this article, you’ll be able to use the four GPUs in full fashion, as shown in the next nvidia-smi photo. 😀

Deep Learning and Multi-GPU 🥕

Deep learning basically learns on the GPU. Deep Neural Networks perform matrix operations by default, which greatly speeds up processing with GPUs. As deep learning has evolved, networks have grown in size. Since the success of deeply building Neural Networks in the vision field, most of the deep learning studies that followed have used large models. In the following figure, you can see that ResNet has 152 layers. In the field of vision, we have been working on improving performance with large datasets and large models since ResNet. In the NLP field, which used a relatively light model, the study is focused on improving performance with large models in large datasets, starting with BERT in 2018.

Image Source: https://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf

Most of the time, we use Nvidia’s GPU, and each GPU has a different amount of memory. Usually, when an individual learns deep learning at home or in the lab, they often use a gaming GPU such as the GTX 1080 TI. These GPUs have the advantage of being more cost effective than GPUs for graphics or deep learning operations. In most cases, even a single GPU such as GTX 1080 TI or TITAN XP does not have a big problem with deep learning. But if you’re training a model on a large dataset in a business or lab, you’re limited to one GPU. In deep learning, batch size often affects performance. One GPU, especially the gaming GPU, has a memory limitation. For example, with 12G memory, TITAN XP can run the BERT base model to batch size 30 or less. There is a significant difference from what we learned with batch size 256 in the BERT paper. In this case, multi-GPU learning. Literally one model is training on multiple GPUs.

Source of photo: http://www.macvidcards.com/store/p97/Nvidia_GTX_1080_Ti_11_GB.html

If you are using multiple GPUs, build your workstation as shown in the following figure. Today, the carrot market has built workstations with four TITAN XPs. Even if you set up a multi-GPU environment on your workstation, it’s not as easy as it sounds. If you use multiple GPUs, you may run into different memory usage for each GPU. In addition, learning is often faster than learning with a single GPU. If you are not familiar with this environment, it can take a long time to get the most out of multi-GPU. Learning using deep learning equipment is not free. It would be less burdensome to build your own workstation and learn in the office or lab, but it would be more expensive to train the model in the cloud. This is because the cost is still incurred while debugging the code to learn with Multi-GPU. Therefore, this article will introduce you to some of the problems and solutions that you encountered while learning multi-GPU with PyTorch.

Photo Source: https://gathering.tweakers.net/forum/list_messages/1870709

Using the PyTorch Data-Parallel Function 🥕

PyTorch provides a feature called Data-Parallel for multi-gpu learning by default. The following figure shows how Data-Parallel works.

Image Source: https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

To use deep learning on multiple GPUs, you must first copy and assign the model to each GPU. And every time iteration, divide the batch by the number of GPUs. This division process is called ‘scatter’ and we actually do this using the scatter function in Data Parallel. After splitting the input, each GPU will go forward. For each input, the model exports outputs and now gathers these outputs on one GPU. Collecting tensors into a single device is called gather.

In deep learning, there is usually a loss function that compares the output of the model with the correct answer. You can back-propagation the loss using the Loss function. Back-propagation is performed on each GPU, resulting in a gradient of the model on each GPU. If you use four GPUs, each model has four GPUs and each model has a calculated gradient. Now, to update the model, we gather the gradients from each GPU to another GPU and update them. If you are using an optimizer such as Adam, it will perform additional operations without updating the model directly with the gradient. This Data Parallel feature is simply a single line of code.

Wrapping the model with nn.DataParallel is what you do when you learn: As mentioned above, proceed to replicate → scatter → parallel_apply → gather. Because gather gathers the output of each model into a single gpu, one gpu has a high memory footprint.

In general, if you use DataParallel, your learning code will return as follows:

The code I used to test PyTorch’s DataParallel is Kim’s BERT code (link: https://github.com/codertimo/BERT-pytorch ). We tested multi-GPU training with a size smaller than the BERT paper’s model size. The length of the sequence entering the model is 163, the number of layers is 8 layers, the number of attention heads is 8 and the number of hidden units is 256. I started learning multi-gpu using GPUs 0, 1, 2, and 3, and then checked GPU usage with nvidia-smi. GPU 0 uses 6G more memory than GPUs 1, 2 and 3. If one GPU uses relatively much memory, you can’t grow a lot of batch sizes. In this experiment, I was able to increase the batch size up to 200. In deep learning, batch size often affects learning performance, so imbalances in memory usage are a must. Also, I often use multi-GPU because I want to learn faster. If learning takes a long time, you may need to learn more for one week due to the difference in batch size.

The simplest way to solve the memory imbalance problem is to simply gather the output on different GPUs. For GPUs that are set by default, gradients are also gathered on those GPUs, which results in significantly more memory usage than other GPUs. Therefore, gathering output to other GPUs can reduce the difference in memory usage. Simply set the GPU number you want to collect the output like in the following code.

If you set output_device and start learning again, you will notice that the GPU usage has changed. GPU 0 has reduced memory usage and GPU 1 has increased memory usage. But you can still see that it doesn’t balance well. The size of the model output depends on the batch size. If you increase the batch size like this, the memory usage of GPU 1 will gradually increase. Therefore, although this may seem like a temporary fix, it is not a proper solution. In addition, if you look at GPU-Util, you can see that it doesn’t utilize the GPU properly.

Using DataParallel with Custom 🥕

Hints on how to solve the memory imbalance problem while using DataParallel are in a package called PyTorch-Encoding (package link: https://github.com/zhanghang1989/PyTorch-Encoding ). The memory usage of one GPU is increasing because the output of the model is collected on one GPU. Why collect the output of your model on one GPU? Because we need to calculate the loss function using the output of the model. The model made it possible to compute in parallel through DataParallel, but since the loss function remains the same, the loss is calculated on one GPU. Therefore, if you make the loss function also operate in parallel, you can solve some memory imbalance problems.

Among the PyTorch-Encodings, the following Python code contains the code that makes the loss function parallel.

Making a loss function parallel in parallel is the same as making a model in parallel. In PyTorch, the loss function is also a module. Replicate this module to each GPU. And scatter the tensors that correspond to the correct answers of the data to each GPU. Then the output of the model to calculate the loss, the correct answer, and the loss function are all changed to be calculated on each GPU. Therefore, the loss value can be calculated on each GPU. On each GPU, the calculated loss can be used to perform backward operations.

Picture Source: https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

The process of making the loss function parallel and operating is as follows. Scatter the target that corresponds to the correct answer of the data and calculate each in the replicated module. The computed output and Reduce.apply make the backward operation on each GPU.

If you use DataParallelCriterion, you should not wrap your model with a regular DataParallel. This is because DataParallel basically gathers output on one GPU. Therefore, we use the DataParallelModel, a Custom DataParallel class. The process of learning using DataParallelModel and DataParallelCriterion is as follows: How to use is quite simple. Just import the parallel.py file from the Pytorch-Encoding package and make it import in the training code.

If you do this, the Nvidia-smi output will look like this: batch size is equal to 200. The difference in memory usage between GPU 1 and GPU 2 has been significantly reduced compared to using DataParallel alone. As the batch size can be increased, the learning time has been reduced by about 1/3. But as you can see from the GPU-Util numbers, we still aren’t getting the most from GPU performance. How can you increase GPU performance to 100%?

Using Distributed Packages in PyTorch

If you are deep learning, you may have heard of distributed learning. DeepMind explains how you learned when you released AlphaGo or AlphaStar. When you are learning this large model, you usually do distributed learning.

Photo Source: https://www.quantamagazine.org/is-alphago-really-such-a-big-deal-20160329/

Distributed learning itself is designed for learning with multiple computers, not just one computer. However, you can also use distributed learning when doing multi-GPU learning. You can implement distributed learning yourself, but you can also use the features provided by PyTorch.

PyTorch, along with DataParallel, provides features related to distributed learning. If you’re curious about how distributed learning works in PyTorch, I recommend following the PyTorch Tutorial.

If you simply want to do multi-GPU learning using distributed learning, you may want to look at the example provided by PyTorch. One of the big datasets in vision is ImageNet. The following link is a code example to train deep learning models on ImageNet. This example shows how to do distributed learning on multiple machines, as well as how to train multiple GPUs on one machine.

In main.py of the ImageNet example, I’ve summarized the main parts of multi-GPU as follows: When you run main.py, main runs. main again runs main_workers multi-processing. View four GPUs as one node and set the world_size. The mp.spawn function then runs main_worker separately on the four GPUs.

The main_worker executes initialization for distributed learning on each GPU through dist.init_process_group. PyTorch’s docs tell you to use nccl as your backend for multi-GPU learning. This is done by writing the ports that can be used for FREEPORT in init_method. After initializing for distributed learning, distributed learning is possible. In line 28, you can see that the model uses DistributedDataParallel instead of DataParallel. It distributes the input mentioned in DataParallel, performs forward operation and performs backward operation again.

The DataLoader uses DistributedSampler to pass input to each process as follows: DistributedSampler must be used with DistributedDataParallel. To use, simply define a dataset wrapped in DistributedSampler and put it as argument to sampler in DataLoader. Then you can use it just as you normally would with a DataLoader.

If you look inside the DistributedSampler, it looks like the following code (many parts are omitted): Each Sampler samples data only from partial data, divided by the total number of GPUs. To create partial data, we randomly shuffle the entire dataset index list, then split that index list and assign it to each GPU Sampler. The list of indexes assigned to each GPU sampler for each epoch is again random. To do this, you need to run the train_sampler.set_epoch (epoch) command before every epoch.

We used the PyTorch Distributed package to train a small BERT model. The GPU memory usage as seen by Nvidia-smi is: You can see that the GPU memory usage is exactly the same. In addition, the number of GPU-Util is also quite high, 99%. If you are here, you are ready for multi-GPU learning.

However, with Distibuted DataParallel, you may occasionally encounter problems when trying to start learning. The following github issue post demonstrates one of several issues. There is an opinion that even when running BERT code, Distributed DataParallel can cause problems if there are parameters in the model that are not used for training. In search of learning without worrying about these problems, I found a package called Apex from Nvidia.

Learning with Nvidia Apex 🥕

Nvidia has created a package for mixed precision operations called Apex. Normally, deep learning does 32 bit operations, with the intention of using 16 bit operations to save memory and speed up learning. Apex includes Distributed-related features in addition to Mixed Precision math. This post does not cover Mixed Precision.

DDP is Apex’s Distributed DataParallel function. Related examples are examples created for learning ImageNet in Apex. Apex usage is well documented in Docs.

As shown in the following line 2, DistributedDataParallel is imported and used in apex. Unlike in the PyTorch official example above, it does not execute multiprocessing within the code. Wrap the model with DDP as shown in line 19. Otherwise, it is the same as PyTorch DistributedDataParallel.

When running this code, use the following command: We run main.py via Torch.distributed.launch and set up four processes running on the node. Each process is trained on one GPU. If you have 2 GPUs, you can change nproc_per_node to 2. I set batch_size and num_worker in main.py, which means batch_size and number of workers for each GPU. If the batch size is 60 and the number of workers is 2, then overall batch size is 240 and the number of workers is 8.

I did multi-GPU learning using Nvidia Apex. GPU usage is as follows: GPU memory usage is constant across all GPUs (GPU 3 is caught because other tasks are allocated). Looking at the GPU-Util, you can see that it’s 99% or 100%.

Choosing a Multi-GPU Learning Method 🥕

There are three ways to learn multi-GPU with PyTorch.

  • DataParallel
  • Custom DataParallel
  • Distributed DataParallel
  • Nvidia Apex

DataParallel is the most basic method provided by PyTorch, but I’ve run into GPU memory imbalance issues. Custom DataParallel solves the problem of GPU memory to some extent, but the problem is that it does not utilize the GPU properly. Distributed DataParallel is a feature of PyTorch that was originally created for distributed learning, but it can also be used for multi-GPU learning, without memory imbalance issues and inability to utilize the GPU. However, because of the occasional problem, we looked at multi-GPU learning using Nvidia-made Apex.

So is it always a good idea to use Apex? These problems I’ve seen don’t always happen when you’re learning deep learning. If you are learning to classify images, DataParallel may be enough. The reason for the GPU memory imbalance issue in BERT is that the model output is quite large. This problem occurs because the number of words in each step is outputted. However, for image classification, the model output is not that large, although the model itself can be large. So there is very little GPU memory imbalance.

To verify this, I learned about PyramidNet on CIFAR-10. The code link used for the tutorial is as follows: CIFAR-10 is a dataset with an image size of 32×32 with 10 categories. PyramidNet is also the highest performing model in CIFAR-10 until recently. PyramidNet can scale the model. To compare the learning performance on Multi-GPU, it is recommended to use a large model. Therefore, we used a model with 24,253,410 parameters. The model that corresponds to PyramidNet (alpha = 270) in the following table. We used four K80s for learning.

Image Source: https://github.com/dyhan0920/PyramidNet-PyTorch

PyramidNet Single GPU (batch size: 240)

First, I trained PyramidNet with Single GPU. If only one GPU is used, a batch size of around 240 is the biggest limit.

When learning with batch size 240, it takes about 6–7 seconds to process one batch. The total learning time (the time it took to train 1 epoch) took about 22 minutes.

PyramidNet DataParallel (batch size: 768)

PyTorch’s basics were to learn PyramidNet using the DataParallel module. Batch size can be raised up to 768. You can use batch sizes that are significantly larger than when using only one GPU. In the photo below, you can see that all GPUs have almost the same memory using only the DataParallel module. Moreover, BERT uses Adam, but PyramidNet uses generic SGD, so there is no memory imbalance issue.

The learning time was originally reduced from 22 minutes to 5 minutes. Also, even though the batch size is 768, you can see that the batch time is faster from 6 seconds to 5 seconds. Therefore, when learning Image Classification, you will find that DataParallel alone is sufficient. If you want to learn with larger batch sizes or learn faster (datasets like ImageNet take much longer), you can use distributed learning. In this case, however, the learning is not done on a single computer but multi-GPU.

Therefore, you must choose how to train multi-GPU depending on the model you are training and also the optimizer. This is more of a problem in the field of natural language processing than in the field of vision.

At the end of this article

Deep learning is hard to read and implement, and there are many efforts to make the best use of resources. Although there are many reviews about deep learning papers and the implementation of papers, there is not much data on how to properly use resources to learn. I hope this helps to deep learning people using PyTorch.