Divide Hugging Face Transformers training time by 2 or more

Original article was published on Deep Learning on Medium



  • Dynamic padding used alone provides a significant training time reduction, that can be reinforced by using uniform size batching and mixed precision;
  • On some setup (small mini batch size + short sequences), mixed precision can produce a longer training time, in other cases, in particular with large mini batch size / long sequences, it is a game changer.


  • Across the 14 runs, 11 obtained in a single epoch a score above 81.18% (the score reported in the Camembert paper for 10 epochs with early stopping);
  • When we compare pairs of runs (same settings with truncation at 128 VS. truncation at 493), it appears unsurprisingly that truncation has on average a (small) cost in accuracy, even if only 3% of the dataset is concerned by the 128-token truncation.

By using both optimizations and mixed precision, we beat in a 16mn training the score of a 4h38 training!

Optimization opportunities

Avoid computations when you are going to throw its result

As explained above, pad token signal is canceled by the application of the attention mask. More pad tokens you put at the end of a sequence, more unused computations you will perform.

In the Trainer class, you define a (fixed) sequence length, and all sequences of the train set are padded / truncated to reach this length, without any exception. On X-NLI, shortest sequences are 10 tokens long, if you provide a 128 tokens length, you will add 118 pad tokens to those 10 tokens sequences, and then perform computations over those 118 noisy tokens.

Worst, as written in the original BERT repo README, “…attention is quadratic to the sequence length. In other words, a batch of 64 sequences of length 512 is much more expensive than a batch of 256 sequences of length 128.”.

A mini batch is made of a small selection of sequences sampled from the dataset. Even when selected randomly in X-NLI, chances are that the longest sequence in a mini batch is shorter than the maximum sequence length set for the whole train set. Because the learning / gradient descent is performed at the mini batch level, we have the opportunity to limit the padding effect, more precisely we can first search for the longest sequence length in the mini batch, and then pad the other sequences accordingly.

Those operations can be performed in the collate_fn function. The purpose of this function is described in the Pytorch documentation, basically it takes the individual examples returned by the Dataset and merges them to build the tensor matrix to send to the model.

Dynamic padding

As explained above, the idea is to adjust the sequence length at the mini batch level instead of dataset level. That way we can limit unused computation. The work is performed inside the Pytorch Dataloader. Let’s remind how it works:

Inside a Pytorch Dataloader (missing my office board 🙁 )

The components:

  • Dataset() is the brick having access to the original text data, being a simple list of strings or something else like a database connector ;
  • Sampler() generates indexes to target a datapoint in the Dataset. It follows a strategy, for instance sequential generation (for a test set) or random generation (for a train set).
  • collate_fn() : for each mini batch, it receives the data points (from the Dataset) selected by the Sampler and groups them in a Tensor (theoretically it can be something else, but usually that’s what you expect as Dataloader output / model input).

collate_fn is the perfect place to perform the dynamic padding. Fortunately, Pytorch Dataloader has a parameter to provide our own implementation in its constructor, no need to override anything. Trainer class from Transformers library has a similar parameter in its constructor, we will use it. Instead of a function, it waits for an instance of a “Collator” (a Transformers specific class) which has a single purpose, wrap the collate method.

Find below a possible implementation of Collator class.

Does dynamic padding help in decreasing training time?

We run 4 experiments that we group per batch size, for each group we compare cases where dynamic padding is used and not. When it is enabled for:

  • batches of 16 not truncated sequences, timing decreased from 4h39 to 0h59 (-79%) ;
  • batches of 64 sequences truncated to 128 tokens, timing decreased from 0h56 to 0h48 (-15%).

Timing decrease is in both cases significant, and is 4X stronger for long sequences. It makes sense, in the train set, 97% of examples are shorter than 128 tokens, so for most of them, we pay a tax for having a 493 max sequence size. By using the optimization, we pay only for the useful computation.

For 128 tokens truncation, there is still a gain as most sequences are still much smaller than 128 tokens, and BERT complexity being quadratic regarding its input length, the avoided computation cost is much lower and training time decreases of “only” 15%.

Does it impact accuracy?

We run 4 experiments that we group per batch size, for each group we compare cases where dynamic padding is used and not. When it is enabled for:

  • batches of 16 not truncated sequences, accuracy raised from 81.42% to 82.0% ;
  • batches of 64 sequences truncated to 128 tokens, accuracy raised from 81.0% to 82.0%.

It appears that accuracy improves with dynamic padding in both cases.

Uniform size batching

Uniform size batching consists of simply building batches made of similar length sequences. The purpose is to make padding as minimal as possible when combined with dynamic padding.

There are many ways to implement it, the one we followed was to:

  • order examples by length in a simple Python list,
  • randomly select an index,
  • extract the example and the n examples following (n being the batch/step size),
  • delete the extracted examples from the list,
  • do it again until there are no more examples in the list.

That way each batch is made of similar length sequences, but following batches are of different lengths.

Naive (simple to understand / not clean) implementation may look something like this:

Does uniform size batching really reduce training time?

For time reduction, we previously show that dynamic padding brings large training time reduction, let’s compare training time with dynamic padding and no uniform size batching, and with both optimizations enabled. For:

  • batch of 16 not truncated sequences, training time decreases from 1h01 to 0h52 (-15%) ;
  • batch of 64 sequences truncated to 128 tokens, training time decreases from 0h48 to 0h30 (-38%).

So in both situations, our naive idea seems to bring another significant training time decrease.

Does uniform size batching impact accuracy in any way?

Usually neural networks are trained on randomly ordered data points. Uniform size batching limits this randomness, hence introduces a kind of bias which may, in theory, impact accuracy.

We will compare the setups with and without the uniform size batching only:

  • For a batch of 16 examples when uniform length batching is activated, accuracy increases from 81.4% to 81.6%;
  • For a batch of 64 examples, when uniform size batching is activated, accuracy increases from 81.0% to 81.7%.

In both cases, there is an improvement, and we may conclude that there is no negative impact on accuracy.

However, we run many experiments combining several options, and according to the Weights & Biases dashboard, the use of uniform size batching is negatively correlated with accuracy. After a manual checking of experiments pairs (with/without the option), this effect is not obvious.

(if you want to go deeper, do not hesitate to check the report)

Mixed precision

Mixed precision is possible on Pytorch through the Nvidia apex library. To make it short, in its most common mode, mixed precision consists of performing most operations with half precision and accumulating results in full precision (more info in apex documentation).

Apex is known for bringing improvement in some scenarios, sometimes it also brings some instability (e.g., the loss amplitude during training is bigger than without mixed precision), and quite rarely it avoids the model to converge. Said otherwise, it’s not a silver bullet, but an interesting tool to test on your case.

The good news is that Trainer class implements it out of the box, to leverage it, you just need to add the right flag to your command line (“ — fp16”).

Regarding training time for mini batches of 16 long sequences, the situation is unusual. For:

  • mixed precision alone makes things better by reducing training time from 4h38 to 2h50 ;
  • mixed precision combined with dynamic padding and uniform size batching, it makes training slower, from 0h52 to 1h01!

The reason is probably that in the second case, it adds overhead and doesn’t help that much as most batches are only made of short sequences. Mixed precision helps the most with big matrix operations.

When applied to mini batches of 64 short sequences, things are as expected:

  • Used alone, time training decreases from 0h56 to 0h26
  • Combined with the 2 other options, time decreases from 0h30 to 0h17

This time, even when the step is made of short sequence, each contains 64 sequences, making the matrix big enough to benefit from mixed precision.

Regarding accuracy, there is no clear pattern. You can make your own idea by yourself by checking the Weights & Biases report.

Reproducible results

All experiments have been run using the same seed. It may happen that we were lucky and our approach was hitting accuracy but not with this seed and on this dataset.

We reran the 16 min training with all optimizations enabled setting 5 times with different seeds and accuracy / timing are reproduced.

again, want interactive graph? -> report <-

A conclusion?

We have shown that both techniques constantly provide significant time reduction without reducing accuracy. Moreover, we learned that on a dataset with small batches, one should be careful with mixed precision, because it can lead to unexpected slower training if there is not enough computation to perform.

We are convinced that both techniques are low-hanging fruits that should be widely used by Transformers users.

To finish on a more general thought, we are pleasantly surprised by the results obtained by such simple ideas. Just for the story, in another unrelated experiment, we noticed that the French train set of X-NLI (which is a machine translation of an english dataset) was of low quality (many examples are absolute nonsense in French), and we were wondering if translating it with a better quality would improve the accuracy on the test set (which is a manual translation). It represented an important opportunity to us, because if it worked it would mean having plenty of dataset in French to play with. We spent a few bucks on DeepL, the translation was much better… and the accuracy didn’t change (we even thought there was a bug in our measures). Not all simple ideas are created equal!