BLEURT -Failures

Original article was published on Artificial Intelligence on Medium

A new metric to measure the textual similarity or should I say a 21st-century metric which can be used maybe forever.

Any task or architecture we build needs to be evaluated so as to set up a benchmark. This benchmark may use some simple metrics like Jaccard or use some fancy mathematical formulation of statistical learning but the same question arises are these metrics enough?

Every NLP researcher asks the question of how can I evaluate my model and when the field is of Natural Language text generation, then the question becomes harder. Consider a simple metric like precision and try to give a similarity score for the image below:

where the Reference 1 is human output and consider the Machine Translation Output as : the the the the the the

How to calculate Precision? (Source: Wikipedia)

Replace documents by words in the above formula and calculate the score.

So precision becomes 1. Hence BLEU (bilingual evaluation understudy) was introduced.

To know more about BLEU you can read it here. But in short, BLEU performs a certain type of n-gram intersection so as to calculate the score.

Similarly, there also exists a ROUGE metric which works on the same basis of BLEU. This metric can work fine where we have only one answer but what about the case when we have more than two output like in the example above we could also have “ The cat sat on the mat ”. Hence these two methods can’t evaluate semantic similarity. Now, what was the need for these scores when we have human evaluators. As you have guessed that performing a human evaluation is quite an expensive and tedious process.

So researchers thought why not train a machine learning model which could learn to give scores i.e train a neural network where you have your model output and a human gold standard for the calculation of loss.

Example Loss of a neural net where y^ is the machine produced output and y is the human gold standard.

After trying various techniques like BEER, RUSE, and ESIM we fell into the same pit i.e lack of training data as it becomes quite difficult to train a new system every-year for metric evaluation. Hence came the process of transfer learning.

BLEURT is also an example of a metric learning process where we use transfer- learning to train and evaluate the NLG tasks.

BLEURT, as it sounds, is based on BERT architecture, the paper also talks about pre-training BERT and using the classification token to predict a score but this would not be an optimal way of evaluation as the authors wanted to create a metric which can be universal and a model which when trained on a very small dataset is able to output scores.

The authors have followed three different steps to create this metric model:
1. BERT Training(Using BERT which has been trained over a huge corpus)

2. Synthetic data training

3. Fine-tuning

The first step is basically training BERT as written in the original paper.

The second step is pretty interesting and this is what makes this paper unique.

For synthetic data-training, they basically put up a BERT model which tries to figure out the different type of scores like BLEU or ROUGE over a billion pairs of data(z, z˜).

In simple words, you are giving the BERT model a sentence pair(z, z˜) and expecting the classification token to output a score and the gold standard for this output score is the original score of BLEU, ROUGE or any other metric.

For the creation of these synthetic pairs, they have used the Wikipedia vocab sentences(z) and for (z˜) they manipulated the z.

For manipulation, they have different techniques like Mask-filling with BERT, Back translation or Dropping Words. Now as our dataset is ready so as mentioned they put it in BERT and expect scores for 6 different evaluation metric.

The third and last step is the fine-tuning where you put some gold standard examples and your model would learn to predict scores based on your task.

Hi, I am Priyanshu and is an active contributor to which is a platform to share the experiences with AI developers’ that address a range of problems in designing and building complex algorithms to perform certain tasks. is an open-source platform to share unuttered exciting content that we encounter while designing AI models. It doesn’t matter whether the problem is a simple one or too complicated. We had our share on topics that may be most of the developers/researchers are actively looking for. We try to cover topics starting from machine learning to programming languages. We write what we addressed and more importantly how did we approach that problem.

If you are reading this then you must be thinking that what is the failure in this metric well not much but there was one which I found pretty interesting and the problem is about the synthetic data training phase where we generate synthetic pairs through different approaches but all based on the transformer architecture. Assume that in ACL’21 a new type of architecture is produced which does very different mistakes than what our transformer does, then would we have to re-train this metric.

Even neural-nets are becoming SOTA and every other paper uses transformers but don’t you think the explainability it provides is ZERO. Let’s hope that we work on transformers and no other type of architecture comes because this SOTA BLEURT would surely fail and hence wouldn’t be so robust.

One important aspect of BLEURT are the results but that needs some special attention which I would rather discuss it in my second article.

If you like the idea of then do try to follow me for more such articles.