Original article was published on Artificial Intelligence on Medium
The recipe for success within the NLP field seems to have become to create larger-than-ever-before models, train them for longer on some consolidation of hundreds of gigabytes of raw text data. Even when this approach falls short of reaching that coveted state-of-the-art spot on the GLUE leaderboard is it possible to throw more compute at the problem. Training multiple models from different initialisation to create an ensemble usually does the trick.
“We trained an ensemble of 8 10 BILLION parameter models for weeks on 1024 TPU-pods-you-could-not-even-afford-a-single-one-of, utilising every single written word found on the Internet”
This is usually how it goes.
But there are good reasons for why the story goes likes this. Because it works!
These training runs — to create the model that is used for the submission — is only the tip of the compute-iceberg. Beneath lies numerous of hyperparameter evaluation trails, ablation studies and benchmark models for comparison. Each of which needs more than one run to account for statistical differences due to initialisation and other random factors.
All this is to say that there goes a lot of compute into every new state-of-the-art result. And for anyone who has bought compute resources from AWS or GCP know, it can be expensive. Really expensive.
So what is the price of training for state-of-the-art models?
To train submission worth model, not an ensemble, the size of BERT large is estimated to come in at $200.000! When looking at the top of leaderboards, only two years after the birth of BERT, is it not outlandish to consider 340 million parameters small. Training GPT-2 or T5, with their 1.5 billion and 11 billion parameters respectably, is estimated to cost in the ballpark of one million dollars! Considering all auxiliary training runs and experiments is it reasonable to believe that the final price-tag for one of these papers is in the tenths of millions of dollars.
All this for science, of course, and a set of carefully tuned matrices.
The million-dollar matrices.
Brute force has its merits, as should have become apparent. But approaching the problem from new angles, applying brains instead of brute is something that definitely will become more mainstream. This is an area of research within NLP I’ve found increasingly interesting to follow — figuring out how we can make due with smaller models, less data and/or more efficient training procedures. To this end has there recently been papers published addressing how to create more efficient models. Both in training and real-world usage. Below, you’ll find three examples of such that I find worth highlighting.
ELECTRA presents a novel pretraining procedure for Transformer architectures that is more parameter efficient than previously used masked language modelling (MLM). While being able to reach comparable performance to MLM-trained models for the same compute does it also enable even higher performance to be achieved.
The attention heads of the Transformer require calculations of quadratic complexity in regards to sequence length. This makes it prohibitively expensive to process longer documents, which is why BERT and many of its relatives are limited to 512 tokens. Longformer introduces a windowed attention mechanism which exhibits linear complexity — allowing us to process much longer sequences for the same compute.
Despite the training improvements brought by the above papers is it still possible to end up with a massive model. Since we probably want to use our models for more than fighting for leaderboard positions, does its size matter. It will directly impact both inference time and, again, cost.
This is where knowledge distillation can prove super useful. It enables larger models to be distilled into smaller ones without losing much of its performance. TinyBERT is an example of this — 7x smaller and 9x faster than BERT base with 96% of its performance!