When are contextual embeddings worth using?

Original article was published by Viktor Karlsson on Artificial Intelligence on Medium

When are Contextual Embeddings Worth Using?

Contextual embeddings from BERT are expensive, and might not bring value in all situations. Let’s figure out when that’s the case!

Working with state-of-the-art models like BERT, or any of its descendants, is not for the resource-limited nor the budget restrained researcher or practitioner. Only pre-training BERT-base, a model that almost could be considered small with today’s standards, took more than 4 days on 16 TPU chips which would cost multiple thousands of dollars. This does not even take further fine-tuning or eventual serving of the model into account, both of which only add to the total cost.

Instead of trying to figure out ways of creating smaller Transformer models, which I’ve explored in previous articles, it would be valuable to take a step back and ask: when are contextual embeddings from Transformer based models actually worth using? In what cases would it be possible to reach similar performance with less computationally expensive, non-contextual embeddings like GloVe or maybe even random embeddings? Are there characteristics of the datasets that could indicate when this would be the case?

These are some of the questions that Arora et al. answer in Contextual Embeddings: When Are They Worth It?. This article will provide an overview of their study and highlight their main findings.


The author’s study is divided into two, first examining the effect of training data volume and then the linguistic characteristics of these datasets.

Training data volume

Arora et al. find that training data volume plays a key role in determining the relative performance of GloVe and random embeddings when compared to BERT. The non-contextual embeddings quickly improve with more training data and were often able to perform within 5–10% of BERT when all available data were used.