Original article was published by Rohan Jagtap on Artificial Intelligence on Medium
The authors have additionally built a clean CommonCrawl Corpus in 100 languages. They’ve used an internal language identification model in addition to fastText.
Building such a huge corpus increases the dataset size, especially for low-resource languages like Burmese and Swahili.
XNLI (Cross-lingual Natural Language Inference)
The XNLI dataset is used. The model is evaluated on cross-lingual transfer from English to other languages. Moreover, it is also tuned on the following machine translation objectives:
- translate-test: Dev and test sets are translated to English.
- translate-train: The English training set is machine-translated to each language.
- translate-train-all: The multilingual model is finetuned on a concatenation of all the training sets from translate-train.
NER (Named Entity Recognition)
The CoNLL-2002 and CoNLL-2003 datasets are used for English, Dutch, Spanish and German languages. The model is finetuned in the following ways:
- Trained on the English set to evaluate cross-lingual transfer.
- On each set to evaluate per-language performance.
- On all sets to evaluate multi-lingual learning.
Cross-lingual Question Answering
The MLQA benchmark is used, which is an extension to the standard SQuAD benchmark in Spanish, German, Arabic, Hindi, Vietnamese and Chinese languages.
And finally, since XLM-RoBERTa is a Language Model, it is evaluated on the standard GLUE Benchmark.
The model capacity refers to the number of parameters in the model. Given a capacity, the idea is to increase the number of languages to improve the cross-lingual transfer performance in similar low-resource languages. While we improve the performance for low-resource languages, this ‘dilution’ degrades the overall performance on the downstream tasks. This dilution has a trade-off with the model capacity.
The experiments show that for a model of a given capacity, increasing the number of languages will improve the performance on cross-lingual tasks up to a point beyond which the performance degrades. This trade-off is termed as the ‘curse of multilinguality’. This can be easily alleviated by increasing the model capacity.
XLM-RoBERTa has achieved state of the art in all the aforementioned cross-lingual objectives.
Results on XNLI: