Multilingual Transformers

Source: Deep Learning on Medium

Multilingual Transformers

Image obtain translating “multilingual transformers” with https://translatr.varunmalhotra.xyz/ and using https://www.wordclouds.com/

Last year, we saw rapid improvements on transformer architectures. Being the GLUE benchmark the main reference point for the state-of-the-art in language understanding tasks, most of the research efforts focused on English data. BERT, RoBERTa, DistilBERT, XLNet — which one to use? provides an overview on recent transformer architectures and their pros and cons.

It is challenging to keep track of the GLUE leader board because the progress on language understanding tasks is so fast paced. Every month a different team takes the top position.

Snapshot of GLUE leader board (beginning of January 2020) https://gluebenchmark.com/leaderboard/

At the same time, transformer architectures have been applied to multilingual tasks. To evaluate these tasks, the approaches discussed here use the cross-lingual Natural Language Inference (XNLI) corpus consisting of labelled sentences in 15 languages. Each data point consist of a Premise and a Hypothesis. Premises and Hypotheses have been labelled for textual entailment: i.e. how the Hypothesis is related to the Premise.

Examples from XNLI paper

Some other examples of labels are “contradictory, neutral”.

From Cross-Lingual NLI Copus (XNLI)

Currently, there is no agreed benchmark on multilingual understanding tasks. The XNLI data set seems to be the main reference to keep track of the evolution of multilingual models. In this note, it is presented a brief overview on the evolution on multilingual transformers for multilingual language understanding.

M-BERT (Multilingual BERT)

Very soon after proposing BERT, Google research introduced a multilingual version of BERT capable of working with more than a 100 languages.

References:

Highlights:

  • 110k shared WordPiece vocabulary across all 104 languages. Low resource languages are up-sampled.
  • It provides some sort of shared representation across different languages.
This plot is a proxy to estimate the amount of similarity between representations in two different languages. E.g. 100% on EN-DE would mean that English and German are mapped to the same representation. More details in How multilingual is Multilingual BERT?
  • However, the model is not explicitly trained to have shared representations across languages. Thus, the result above is somehow surprising.
From How multilingual is Multilingual BERT?
  • Late results say that lexical overlap between languages plays little role in cross-language performance.
  • Instead, a deeper network provides better performance cross-languages.

Resources needed:

Pre-trained on 4 to 16 Cloud TPUs.

License:

Apache License 2.0

XLM (croX lingual Language Model)

This model was proposed by researchers in Facebook at the beginning of 2019.

References:

Highlights:

  • Trained on parallel corpora using an approach defined Translation Language Modelling (TLM)
Example of parallel sentence in English and French. From Cross-lingual Language Model Pretraining
  • 80k BPE (Byte Pair Encoding) tokens trained on a 95k vocabulary. BPE is very similar to the WordPiece tokenization approach, link. BPE works by clustering characters in a hierarchical fashions and seems to be more commonly applied to cross-lingual models.
  • On the XNLI benchmark, it achieves very good performance on Zero-shot. Even better performance if translated data is used during training.
From Cross-lingual Language Model Pretraining

Resources needed:

License:

Attribution-NonCommercial 4.0 International

XLM-R (XLM-RoBERTa)

Researchers in Facebook proposed this model at the end of 2019 following the steps of RoBERTa. As for RoBERTa, main contributions are about choosing a better training setup. (As a side note for RoBERTa, even if this pushes performance up, was not considered to provide enough technical contribution to be accepted at ICLR 2020).

References:

Highlights:

  • More data and more computational power!
XLM was trained only on Wikipedia data, when XLM-R was trained on CommonCrawl data. From Unsupervised Cross-lingual Representation Learning at Scale
  • Sentence Piece tokenization on a 250k vocabulary. They also use a unigram language model for Sentence Piece rather than BPE.
  • No language embeddings to better deal with code-switching (i.e. text that alternates different languages)
  • Achieve state-of-the-art results (end of 2019).
From Unsupervised Cross-lingual Representation Learning at Scale

Resources needed:

License:

Attribution-NonCommercial 4.0 International

Summary and Outlook

XLM-R seems to be the best solution to date. It is very possible that the TLM (Translation Language Model) approach to train multilingual transformers will be combined with other technologies. In particular, it is easy to foresee a combination of technologies at the top of the GLUE leader board and TLM. In the machine learning community, there is still much interest in transformers. For example, ALBERT and ALICE have been recently accepted at ICLR 2020.

The multilingual transformers discussed here can be found pre-trained in Google’s and Facebook’s repository, respectively:

  • M-BERT from Google, link.
  • XLM, and XLM-R from Facebook, link

All the models can also be very easily tested out using HuggingFace Transformers code. Written in PyTorch. License: Apache License 2.0