Reducing BERT Pre-Training Time from 3 Days to 76 Minutes

Source: Deep Learning on Medium

Photo by Colin Behrens

This research summary is just one of many that are distributed weekly on the AI scholar newsletter. To start receiving the weekly newsletter, sign up here.

Bidirectional Encoder Representations from Transformers (BERT) is a state-of-the-art deep learning language model that achieves high results in several NLP tasks. However, like many deep learning models, BERT training is time-consuming.

Large batch training can reduce the training clock time, but then large-batch training is very challenging and makes it hard to achieve accuracy. Existing large-batch training techniques do not perform well when the batch scale becomes extremely large. Also, large batch training has a generalization gap problem. Recent research efforts have now made it possible to reduce the time it takes to train a BERT model.

Layer-wise Adaptive Moments optimizer for Batch training (LAMB)

Motivated by existing large batch optimizers, Adam and LARS, researchers have now introduced a new optimizer, LAMB, to reduce BERT training time from days to minutes.

LAMB works for both small and large batches, supports adaptive element-wise updating as well as accurate layer-wise correction. Users are only required to tune the learning rate without other hyper-parameters.

By using LAMB, the researchers we can scale the batch size of BERT pre-training to 64K without losing accuracy. This way, they were able to reduce their training time significantly. The baseline needs about 1 million iterations to finish the BERT pre-training, but LAMB optimizer requires only about 8599 iterations which makes it possible to reduce BERT-training to 76 minutes from 3 days to approximately 76 minutes.

Potential Uses and Effects

What a milestone for deep learning model training! Large batch scale training techniques are key to speeding up deep neural network training. It’s no doubt they can help the deep learning community to reduce training time for faster model evaluation as well as achieve the best accuracy they can get for NLP processing tasks.

Thanks for reading. Please comment, share and don’t forget to subscribe! Also, follow me on Twitter and LinkedIn. Remember to 👏 if you enjoyed this article. Cheers!