BLEU — Bilingual Evaluation Understudy

Source: Deep Learning on Medium

BLEU — Bilingual Evaluation Understudy

A step by step approach to understanding BLEU, a metric to understand the effectiveness of Machine Translation(MT)

What will you learn in this post?

  • How to measure the effectiveness of translating one language to another?
  • What is BLEU and how do we calculate the BLEU score for the effectiveness of the MT translation?
  • Understand the formulae for BLEU, What is Modified Precision, Count Clip and Brevity Penalty(BP)
  • Step by step calculation of the BLEU using an example
  • Calculating BLEU score using python nltk library

You are watching a very popular movie of a language that you do not understand, so you read the captions in a language that you know.

How do we know that the translations are good enough to convey the right meaning?

We look at the adequacy, fluency, and fidelity of the translations to know it’s effectiveness.

Adequacy is a measure to know if all the meaning was expressed from source language to the target language

Fidelity is the extent to which a translation accurately renders the meaning of the source text

Fluency measures how grammatically well-formed the sentences are along with ease of interpretation.

Another challenge with translations for a sentence is in the usage of different word choices and changing the word order. Below are a few examples.

Different word choices but conveying the same meaning

I enjoyed the concert

I liked the show

I relished the musical

Different word order conveying the same message

I was late for office due to traffic jam

The traffic jam was responsible for my delay to office

Traffic jam delayed me to office

With all these complexities, how can we measure the effectiveness of a machine translation?

We will use the main idea as described by Kishore Papineni

We will measure the closeness of translation by finding legitimate differences in word choice and word order between the reference human translation and translation generated by the machine.

A few terms in context with BLEU

Reference translation is Human translation

Candidate Translation is Machine translation

To measure the machine translation effectiveness, we will evaluate the closeness of the machine translation to human reference translation using a metric known as BLEU-Bilingual Evaluation Understudy.

Let’s take an example where we have the following reference translations.

  1. I always do.
  2. I invariably do.
  3. I perpetually do.

We have two different candidates from machine translation

  1. I always invariably perpetually do.
  2. I always do

Candidate 2 I always do shares most words and phrases with these three reference translations. We come to this conclusion by comparing n-gram matches between each candidate translation to the reference translations.

What do we mean by n-gram?

An n-gram is a sequence of words occurring within a given window where n represents the window size.

Let’s take the sentence, “Once you stop learning, you start dying” to understand n-grams.

unigram, bigram, and trigram for the sentence, “Once you stop learning, you start dying.

BLEU compares the n-gram of the candidate translation with n-gram of the reference translation to count the number of matches. These matches are independent of the positions where they occur.

The more the number of matches between candidate and reference translation, the better is the machine translation.

Let’s start with a familiar metric: Precision.

In terms of Machine Translation, we define Precision as ‘the count of the number of candidate translation words which occur in any reference translation’ divided by the ‘total number of words in the candidate translation.’

Let’s take an example and calculate the precision for the candidate translation

  • The precision for candidate 1 is 2/7 (28.5%)
  • The Precision for candidate 2 is 1(100%).

These are unreasonably high precision, and we know these are not good translations.

To solve the issue, we will use modified n-gram precision. It is computed in multiple steps for each n-gram.

Let’s take an example and understand how the modified precision score is calculated. We have three human reference translation and a machine-translated candidate

We first calculate Count clip for any n-gram using the following steps

  • Step1: Count the maximum number of times a candidate n-gram occurs in any single reference translation; this is referred to as Count.
  • Step 2: For each reference sentence, count the number of times a candidate n-gram occurs. As we have three reference translations, we calculate, Ref 1 count, Ref2 count, and Ref 3 count.
  • Step 3: Take the maximum number of n-grams occurrences in any reference count. Also known as Max Ref Count.
  • Step 4: Take the minimum of the Count and Max Ref Count. Also known as Count clip as it clips the total count of each candidate word by its maximum reference count
  • Step 5: Add all these clipped counts.

Below we have clip counts for unigram and bigrams

Clip Count for unigram
Clip count for bigram
  • Step 6: Finally, divide the clipped counts by the total (unclipped) number of candidate n-grams to get the modified precision score.
Pₙ is modified precision score
  • The modified precision score for the unigram is 17/18
  • The modified precision score for bi-gram is 10/17

Summarizing modified precision score

Modified precision Pₙ: Sum of the clipped n-gram counts for all the candidate sentences in the corpus divide by the number of candidate n-grams

How does this modified precision score help?

Modified n-gram precision score captures two aspects of translation: adequacy and fluency.

  • A translation using the same words as in the references tends to satisfy adequacy.
  • The longer n-gram matches between candidate and reference translation account for fluency

What happens if the translations are too short or too long?

We add brevity penalty to handle too short translations.

Brevity Penalty(BP) will be 1.0 when the candidate translation length is the same as any reference translation length. The closest reference sentence length is the “best match length.”

With the brevity penalty, we see that a high-scoring candidate translation will match the reference translations in length, in word choice, and word order.

BP is an exponential decay and is calculated as shown below

r- count of words in a reference translation

c- count of words in a candidate translation

Note: Neither the brevity penalty nor the modified n-gram precision length directly considers the source length; instead, they only consider the range of reference translation lengths of the target language

Finally, we calculate BLEU

BP- brevity penalty

N: No. of n-grams, we usually use unigram, bigram, 3-gram, 4-gram

wₙ: Weight for each modified precision, by default N is 4, wₙ is 1/4=0.25

Pₙ: Modified precision

The BLEU metric ranges from 0 to 1. When the machine translation is identical to one of the reference translation, it will attain a score of 1. For this reason, even a human translator will not necessarily score 1.

I hope you now have a good understanding of BLEU.

BLEU metric is used for

  • Machine Translation
  • Image captioning
  • Text summarization
  • Speech recognition

How can I calculate BLEU in python?

nltk library provides implementation to calculate the BLRU score

Importing the required library

import nltk.translate.bleu_score as bleu

Setting the two different candidate translation that we will compare with two reference translations

reference_translation=['The cat is on the mat.'.split(),
'There is a cat on the mat.'.split()
]
candidate_translation_1='the the the mat on the the.'.split()
candidate_translation_2='The cat is on the mat.'.split()

Calculating the BLEU score for candidate translation 1

print("BLEU Score: ",bleu.sentence_bleu(reference_translation, candidate_translation_1))

Calculating the BLEU score for candidate translation two, where the candidate translation matches with one of the reference translation

print("BLEU Score: ",bleu.sentence_bleu(reference_translation, candidate_translation_2))

We can also create our own methods in python using nltk library for calculating BLEU available in github

References:

BLEU: a Method for Automatic Evaluation of Machine Translation Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu

https://www.statmt.org/book/slides/08-evaluation.pdf

http://www.nltk.org/_modules/nltk/translate/bleu_score.html