How To Make The Most Out Of Bert Finetuning

Original article can be found here (source): Deep Learning on Medium

The graphs above demonstrate that it is possible to identify bad initializations which will lead to bad models at the end of training. This is especially true for the smaller data sets. For the larger SST dataset this appears to be less obvious from the graph, but in this case there is still a strong correlation between validation performance after 2 epochs and final validation performance.

The question then becomes how to decide when to stop training a model, and how many models to train. The authors use an algorithm that was inspired by an early stopping criterion for hyperparameter search³ for this purpose. The algorithm takes the following 3 parameters:

  • t: the number of models we start training
  • f: when to evaluate models, as a fraction of the total number epochs
  • p: the number of top performing models to continue training

Running this algorithm takes (tf +p(1−f))s number of steps to complete where s is the number of total epochs (in this case s=3). The authors obtain the best results with f in the region of 20–30%. They also run experiments and show the best parameters for different computational budgets. Common trends are:

  • t should be substantially larger than p; and
  • p should be roughly half the number of models our computational budget would allow us to train fully (for s epochs).

The results are summarised in the graph below. It shows the relative error reduction when using the above algorithm for each of the 4 tasks. Error reduction is relative to not using the above algorithm — that is, just training a certain number of models fully (x-axis) and selecting the best one. As we can see, for any computational budget, the early stopping algorithm leads to a sizeable performance increase.

Relative error reduction when finetuning BERT with the above early stopping algorithm, compared to just training t number of models (x-axis) fully [Source]

The Key Takeaway:

Starting many models, stopping the bad ones early, and proceeding with only a few can lead to better performance overall when finetuning BERT on a constrained budget.


In a resource constrained setting (there is a fixed computational budget) use these 2 tips to make the most out of BERT finetuning:

  1. Evaluate your model multiple times during an epoch; and
  2. Identify bad initializations early and stop them.

Resource Constrained Pretraining

If you found this interesting, you might also want to check out the article below, which discusses ways to improve the pretraining of Transformer models like BERT.