Source: Deep Learning on Medium
Last year, we saw rapid improvements on transformer architectures. Being the GLUE benchmark the main reference point for the state-of-the-art in language understanding tasks, most of the research efforts focused on English data. BERT, RoBERTa, DistilBERT, XLNet — which one to use? provides an overview on recent transformer architectures and their pros and cons.
It is challenging to keep track of the GLUE leader board because the progress on language understanding tasks is so fast paced. Every month a different team takes the top position.
At the same time, transformer architectures have been applied to multilingual tasks. To evaluate these tasks, the approaches discussed here use the cross-lingual Natural Language Inference (XNLI) corpus consisting of labelled sentences in 15 languages. Each data point consist of a Premise and a Hypothesis. Premises and Hypotheses have been labelled for textual entailment: i.e. how the Hypothesis is related to the Premise.
Some other examples of labels are “contradictory, neutral”.
Currently, there is no agreed benchmark on multilingual understanding tasks. The XNLI data set seems to be the main reference to keep track of the evolution of multilingual models. In this note, it is presented a brief overview on the evolution on multilingual transformers for multilingual language understanding.
M-BERT (Multilingual BERT)
Very soon after proposing BERT, Google research introduced a multilingual version of BERT capable of working with more than a 100 languages.
- 110k shared WordPiece vocabulary across all 104 languages. Low resource languages are up-sampled.
- It provides some sort of shared representation across different languages.
- However, the model is not explicitly trained to have shared representations across languages. Thus, the result above is somehow surprising.
- Late results say that lexical overlap between languages plays little role in cross-language performance.
- Instead, a deeper network provides better performance cross-languages.
Pre-trained on 4 to 16 Cloud TPUs.
XLM (croX lingual Language Model)
This model was proposed by researchers in Facebook at the beginning of 2019.
- Trained on parallel corpora using an approach defined Translation Language Modelling (TLM)
- 80k BPE (Byte Pair Encoding) tokens trained on a 95k vocabulary. BPE is very similar to the WordPiece tokenization approach, link. BPE works by clustering characters in a hierarchical fashions and seems to be more commonly applied to cross-lingual models.
- On the XNLI benchmark, it achieves very good performance on Zero-shot. Even better performance if translated data is used during training.
Researchers in Facebook proposed this model at the end of 2019 following the steps of RoBERTa. As for RoBERTa, main contributions are about choosing a better training setup. (As a side note for RoBERTa, even if this pushes performance up, was not considered to provide enough technical contribution to be accepted at ICLR 2020).
- More data and more computational power!
- Sentence Piece tokenization on a 250k vocabulary. They also use a unigram language model for Sentence Piece rather than BPE.
- No language embeddings to better deal with code-switching (i.e. text that alternates different languages)
- Achieve state-of-the-art results (end of 2019).
Summary and Outlook
XLM-R seems to be the best solution to date. It is very possible that the TLM (Translation Language Model) approach to train multilingual transformers will be combined with other technologies. In particular, it is easy to foresee a combination of technologies at the top of the GLUE leader board and TLM. In the machine learning community, there is still much interest in transformers. For example, ALBERT and ALICE have been recently accepted at ICLR 2020.
The multilingual transformers discussed here can be found pre-trained in Google’s and Facebook’s repository, respectively: