XLM-RoBERTa: Unsupervised Cross-lingual Representation Learning at Scale

Original article was published by Rohan Jagtap on Artificial Intelligence on Medium

The authors have additionally built a clean CommonCrawl Corpus in 100 languages. They’ve used an internal language identification model in addition to fastText.

Building such a huge corpus increases the dataset size, especially for low-resource languages like Burmese and Swahili.

Fine Tuning

XNLI (Cross-lingual Natural Language Inference)

The XNLI dataset is used. The model is evaluated on cross-lingual transfer from English to other languages. Moreover, it is also tuned on the following machine translation objectives:

  1. translate-test: Dev and test sets are translated to English.
  2. translate-train: The English training set is machine-translated to each language.
  3. translate-train-all: The multilingual model is finetuned on a concatenation of all the training sets from translate-train.

NER (Named Entity Recognition)

The CoNLL-2002 and CoNLL-2003 datasets are used for English, Dutch, Spanish and German languages. The model is finetuned in the following ways:

  1. Trained on the English set to evaluate cross-lingual transfer.
  2. On each set to evaluate per-language performance.
  3. On all sets to evaluate multi-lingual learning.

Cross-lingual Question Answering

The MLQA benchmark is used, which is an extension to the standard SQuAD benchmark in Spanish, German, Arabic, Hindi, Vietnamese and Chinese languages.

GLUE Benchmark

And finally, since XLM-RoBERTa is a Language Model, it is evaluated on the standard GLUE Benchmark.


Curse of Multilinguality via XLM-RoBERTa Paper

The model capacity refers to the number of parameters in the model. Given a capacity, the idea is to increase the number of languages to improve the cross-lingual transfer performance in similar low-resource languages. While we improve the performance for low-resource languages, this ‘dilution’ degrades the overall performance on the downstream tasks. This dilution has a trade-off with the model capacity.

The experiments show that for a model of a given capacity, increasing the number of languages will improve the performance on cross-lingual tasks up to a point beyond which the performance degrades. This trade-off is termed as the ‘curse of multilinguality’. This can be easily alleviated by increasing the model capacity.


XLM-RoBERTa has achieved state of the art in all the aforementioned cross-lingual objectives.

Results on XNLI: