One Language Model to Rule Them All

Source: Deep Learning on Medium

OpenAI Masters Many Natural Language Tasks with a Single Unsupervised Model

Natural language understanding(NLU) is one of the richest areas in deep learning which includes highly diverse tasks such as reaching comprehension, question-answering or machine translation. Traditionally, NLU models focus on solving only of those tasks and are useless when applied to other NLU-domains. Also, NLU models have mostly evolved as supervised learning architectures that require expensive training exercises. Recently, researchers from OpenAI challenged both assumptions in a paper that introduces a single unsupervised NLU model that is able to achieve state-of-the-art performance in many NLU tasks.

The idea of using unsupervised learning for different NLU tasks has been gaining traction in the last few months. Google recently open sourced BERT, a new library for pre-training models on different NLU tasks. Facebook also ventured into the space with the release of PyText, a PyTorch-based framework for the implementation of simpler NLU workflows. Both frameworks are rooted in this idea that unsupervised models can master many NLU tasks. In the context of deep learning, those type of models are known as unsupervised multitask learners.

Multitask NLU Learning

The idea behind OpenAI’s NLU research is that you can achieve state-of-the-art performance across different NLU tasks without any task-specific training. To achieve that, OpenAI created an NLU model that was trained to master a single task: given a set of words, predict the next word on a large dataset. OpenAI model was called GPT-2 and was trained using 1.5 billion parameters and a dataset of 8 million web pages. The training dataset was also very diverse which helped the model achieved proficiency across different NLU tasks.

The idea of learning a master task and transfer that knowledge into specific NLU tasks is compelling but not exactly trivial to implement. For starters, it is unclear what type of optimization objectives are most effective at learning the specific NLU tasks. Secondly, there is no consensus on the most effective way to transfer these learned representations to the target task. Recent work from OpenAI dabbled into the idea of using semi-supervised learning architecture for transferring knowledge between different NLU tasks. The GPT-2 model is a continuation of this work. To achieve its goals, GPT-2 decided to leverage a very simple and yet effective neural network architecture that can be adapted to many domains.

The Architecture

The GPT-2 architecture was a variation of the famous Transformer architecture proposed by the Google Brain team in their paper “Attention is all You Need”. At its core, the Transformer architecture provides a generic mechanism based on encoder-decoders to detect dependencies between inputs and outputs. In the Transformer model, the encoder maps an input sequence of symbol representations (x1; :::; xn) to a sequence of continuous representations z = (z1; :::; zn). Given z, the decoder then generates an output sequence (y1; :::; ym) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

Building on the Transformer architecture, the OpenAI team created a variation optimized for multitask NLU learning. While the Transformer architecture detects long term dependencies between textual data, it does nothing in terms of learning specific tasks. The GPT-2 architecture extends the core Transformer model by injecting optimizations for specific NLU tasks. Additionally, GPT-2 optimizes knowledge transfer between the different layers becoming more robust across the entire spectrum of NLU tasks.

Obviously, GPT-2 is not a magic model and still requires modifications for specific NLU tasks. Some tasks such as text classification can be achieved by simple tuning on some layers of the model while others such as question-answering require more complex modifications. In general, GPT-2 traversal-style approach, which converts structured inputs into an ordered sequence that the pre-trained model can process.

In Action

The OpenAI team benchmarked GPT-2 against different task-specific NLU models. Without any task specific training, GPT-2 was able to achieve state-of-the-art performance across many NLU tasks. Let’s look at a couple of impressive results.

Text Generation

In this task, GPT-2 was presented with a text input and was able to generate a synthetic text output that match the written style and coherence of the input. The results were impressive as shown in the following example:


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them — they were so close they could touch their horns.

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.


In this task, GPT-2 needed to answer arbitrary questions about a given passage. Let’s look at one example:


The 2008 Summer Olympics torch relay was run from March 24 until August 8, 2008, prior to the 2008 Summer Olympics, with the theme of “one world, one dream”. Plans for the relay were announced on April 26, 2007, in Beijing, China. The relay, also called by the organizers as the “Journey of Harmony”, lasted 129 days and carried the torch 137,000 km (85,000 mi) — the longest distance of any Olympic torch relay since the tradition was started ahead of the 1936 Summer Olympics.

After being lit at the birthplace of the Olympic Games in Olympia, Greece on March 24, the torch traveled to the Panathinaiko Stadium in Athens, and then to Beijing, arriving on March 31. From Beijing, the torch was following a route passing through six continents. The torch has visited cities along the Silk Road, symbolizing ancient links between China and the rest of the world. The relay also included an ascent with the flame to the top of Mount Everest on the border of Nepal and Tibet, China from the Chinese side, which was closed specially for the event.

Q: What was the theme?
 A: “one world, one dream”.

Q: What was the length of the race?
 A: 137,000 km

Q: Was it larger than previous ones?
 A: No

Q: Where did the race begin?
 A: Olympia, Greece

Q: Is there anything notable about that place?
 A: birthplace of Olympic Games

Q: Where did they go after?
 A: Athens

Q: How many days was the race?
 A: seven

Q: Did they visit any notable landmarks?
 A: Panathinaiko Stadium

Q: And did they climb any mountains?


The results shown by GPT-2 demonstrate that is possible for a single NLU model to be adapted to different tasks. However, that fact also has important societal implications as bad actors can trick models malicious purposes such as fake news generation or spam generation. For this reason, OpenAI decided to only open source a small version of the GPT-2 until more research can be done in this area. Regardless, the principles of GPT-2 are an important milestone in NLU research.