How to Break GPU Memory Boundaries Even with Large Batch Sizes

Source: Deep Learning on Medium

The problem: batch size being limited by available GPU memory

When building deep learning models, we have to choose batch size — along with other hyperparameters. Batch size plays a major role in the training of deep learning models. It has an impact on the resulting accuracy of models, as well as on the performance of the training process.

The range of possible values for the batch size is limited today by the available GPU memory. As the neural network gets larger, the maximum batch size that can be run on a single GPU gets smaller. Today, as we find ourselves running larger models than ever before, the possible values for the batch size become smaller and might be far from the optimal values.

Gradient accumulation is a way to enable running batch sizes that do not fit into the GPU memory in a trivial way.

What is batch size?

The batch size is the number of samples (e.g. images) used to train a model before updating its trainable model variables — the weights and biases. That is, in every single training step, a batch of samples is propagated through the model and then backward propagated to calculate gradients for every sample. The gradients of all samples will then be averaged or summed up and this value will be used as an input to a formula (depending on the chosen optimizer) that calculates the updates for the trainable model variables. Only after updating the parameters will the next batch of samples go through the same process.

Determining optimal batch size

Batch size has a critical impact on the convergence of the training process as well as on the resulting accuracy of the trained model. Typically, there is an optimal value or range of values for batch size for every neural network and dataset. Different neural networks and different datasets may have different optimal batch sizes.

There might be critical consequences when using different batch sizes that should be taken into consideration when choosing one. Let’s cover two of the main potential consequences of using small or large batch sizes:

  • Generalization: Large batch sizes may cause bad generalization (or even get stuck in a local minimum). Generalization means that the neural network will perform quite well on samples outside of the training set. So, bad generalization — which is pretty much overfitting — means that the neural network will perform poorly on samples outside of the training set.
  • Convergence speed: Small batch sizes may lead to slow convergence of the learning algorithm. The variable updates applied in every step, that were calculated using a batch of samples, will determine the starting point for the next batch of samples. Training samples are randomly drawn from the training set every step and therefore the resulting gradients are noisy estimates based on partial data. The fewer samples we use in a single batch, the noisier and less accurate the gradient estimates will be. That is, the smaller the batch, the bigger impact a single sample has on the applied variable updates. In other words, smaller batch sizes may make the learning process noisier and fluctuating, essentially extending the time it takes the algorithm to converge.

With all that in mind, we have to choose a batch size that will be neither too small nor too large but somewhere in between. The main idea here is that we should play around with different batch sizes until we find one that would be optimal for the specific neural network and dataset we are using.

Different batch sizes have different consequences. A too-small batch size may result in slow convergence.

Impact of batch size on the required GPU memory

While traditional computers have access to a lot of RAM, GPUs have much less, and although the amount of GPU memory is growing and will keep growing in the future, sometimes it’s not enough. The training batch size has a huge impact on the required GPU memory for training a neural network. In order to further understand this, let’s first examine what’s being stored in GPU memory during training:

  1. Parameters — The weights and biases of the network.
  2. Optimizer’s variables — Per-algorithm intermediate variables (e.g. momentums).
  3. Intermediate calculations — Values from the forward pass that are temporarily stored in GPU memory and then used in the backward pass. (e.g. the activation outputs of every layer are used in the backward pass to calculate the gradients)
  4. Workspace — Temporary memory for local variables of kernel implementations.

NOTE: While (1) and (4) are always required, (2) and (3) are required only in training mode.

So, the larger the batch size, the more samples are being propagated through the neural network in the forward pass. This results in larger intermediate calculations (e.g. layer activation outputs) that need to be stored in GPU memory. Technically speaking, the size of the activations is linearly dependent on the batch size.

It is now clearly noticeable that increasing the batch size will directly result in increasing the required GPU memory. In many cases, not having enough GPU memory prevents us from increasing the batch size. Let’s now see how we could break the GPU memory boundaries and still use larger batch sizes.

Larger batch sizes require more GPU memory

Using larger batch sizes

One way to overcome the GPU memory limitations and run large batch sizes is to split the batch of samples into smaller mini-batches, where each mini-batch requires an amount of GPU memory that can be satisfied. These mini-batches can run independently, and their gradients should be averaged or summed before calculating the model variable updates. There are two main ways to implement this:

  1. Data-parallelism — use multiple GPUs to train all mini-batches in parallel, each on a single GPU. The gradients from all mini-batches are accumulated and the result is used to update the model variables at the end of every step.
Data-parallelism
  1. Gradient accumulation — run the mini-batches sequentially, while accumulating the gradients. The accumulated results are used to update the model variables at the end of the last mini-batch.
Gradient accumulation

Similarities between data-parallelism and gradient accumulation

Data-parallelism gradient accumulation have many characteristics and constraints in common:

  • Neither of them enables running models that require more GPU memory than available (even with a single sample).
  • Batch normalization is being done separately on every mini-batch and not on the global batch, which causes them to not be completely equivalent to running the same model using the global batch size. (NOTE: Although batch normalization on the global batch may be implemented in DP, this is usually not the case, and it’s being done separately.)
  • They both allow us to increase the global batch size while still being limited by GPU memory.

While both of the options are pretty similar, gradient accumulation may be done sequentially using a single GPU, making it more attractive for users who cannot access more than a single GPU, or users that want to minimize resource usage.

Additionally, the two can be used together. In this way, we would use several GPUs, run a few steps and accumulate the gradients on each GPU, and reduce the accumulated results from all the GPUs at the end of the step.

We have run some experiments this way and refer to this as elasticity. The Run:AI product utilizes this feature to increase the utilization of GPU clusters and improve the productivity of data science teams. We will share some more details on these concepts in future posts.

Although there are similarities between gradient accumulation and data-parallelism, their implementations are completely different. We will be focusing on gradient accumulation in the next posts.