Source: Deep Learning on Medium
Google Fights Language Intelligence Inequality with Massively Scalable, Multilingual Models
Google researchers designed a universal machine translation system that can work across a hundred languages.
Natural language systems have been at the center of the artificial intelligence(AI) renaissance of the last few years. However, the benefits of language intelligence programs have been constrained to the most popular spoken languages in the world. It is trivial to get Alexa to understand almost anything in English but try one of the Kru languages spoken in West Africa and the story is completely different. While there are over 7000 languages spoken in the world, just about 20 of them account for more than half of the world’s population. In the context of AI, designing linguistic intelligence models that can work seamlessly with data-scarce languages is a priority to maintain the equality of the space. Recently, Google published several whitepapers that detail some novel efforts to design multilingual systems that can scale across hundreds of languages.
The challenge of designing language intelligence systems that work efficiently with data-limited languages is far from trivial. Areas such as speech recognition or machine translation are notorious for requiring large volumes of labeled data. How can we adapt those models to data-scarce languages without sacrificing quality? Could we extrapolate knowledge from data-rich languages onto other languages? Can we achieve decent levels of quality in language intelligence systems for data-scarce languages? Google has been slowly tackling those challenges by building a more effective approach to machine translantion.
Building Massively Multilingual Systems
In “Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges”, Google researchers play with the idea of building a semi-universal neural machine translation(NMT) model. Specifically, they designed a single NMT model that could be trained on 25+ billion sentence pairs, from 100+ languages to and from English, with 50+ billion parameters.
The goal of Google’s paper was not so much to propose a novel architecture to NMT systems but rather to understand their behavior. Despite the appealing of multi-language NMT models, most approaches have been developed under constrained settings; their efficacy is yet to be demonstrated in real-world scenarios.
Designing universal NMT models is an inductive bias problem. More specifically, in the setting of multilingual NMT, the underlying inductive bias is that the learning signal from one language should benefit the quality of other languages. From that perspective, the expectation is that as we increase the number of languages, the learner will generalize better due to the increased amount of information added by each language. Google’s research paper summarizes the challenges of building a universal NMT: multi-domain and multi-task learning across a very large number of domains/tasks, with wide data imbalance, and very heterogeneous intertask relationships arising from dataset noise and topic/style discrepancies, differing degrees of linguistic similarity, etc.
Despite the overwhelming nature of that description, maybe the most important aspect to realize is that those challenges ar typically tackled by individual models and we don’t know whether is possible to build them in a universal NMT. In general, a universal NMT approach should have the following characteristics:
· Maximum throughput in terms of the number of languages considered within a single model.
• Maximum inductive (positive) transfer towards low-resource languages.
• Minimum interference (negative transfer) for high-resource languages.
• Robust multilingual NMT models that perform well in realistic, open-domain settings.
For the experiments, Google used a dataset of 25+ billion examples from 103 languages. One of the key discoveries of the Google experiment is that NMT systems can learn shared representations across similar languages. This could be used to transfer knowledge from one language learner to another. The following figure shows a cluster analysis of the different languages involved in the experiment.
In order to achieve multilingual effectiveness, Google leveraged an adaptation of the traditional transformer architecture with an adaptation layer that allows the specialization on different tasks. Google leverage GPipe to train 128-layer Transformers with over 6 billion parameters.
After the model was trained, it showed a strong positive transfer from high-resource towards low-resource languages, dramatically improving the translation quality of 30+ languages at the tail of the distribution by an average of 5 BLEU points. This finding hints that massively multilingual models are effective at generalization, and capable of capturing the representational similarity across a large body of languages.
The initial generalization works well for a small number of low-resource languages but it has some unexpected side effects as this number starts to increase. Specifically, it is observed that the quality of high-resource language translations starts to decline. Google used the aforementioned GPipe architecture to representational capacity of our neural networks by making them bigger by increasing the number of model parameters to improve the quality of translation for high-resource languages. Another interesting innovation was to substitute the basic feed-forward layers of the Transformer architecture with sparsely-gated mixture of experts, we drastically scale up the model capacity, allowing to train the model with over 50 billion parameters, which further improved translation quality across the board.
Google approach to a universal NMT is an initial step towards expanding the capabilities of machine translation to thousands of languages. With over 7000 spoken languages in the world, it is computationally unpractical to train individual neural translation models on every domain or task for those languages. From that perspective, a universal NMT is the only path forward to avoid inequality in machine translation. If we also factor in that over half of those 7000 languages won’t exist by the end of the century, a universal NMT is not only a path to better intelligence but also to preserve the history of different cultures.