When machines write wikipedia

Deep learning paper explained simply.

If you have never liked writing summaries, then good for you — machines are on their way to take over this task.

To teach a model how to summarise articles, we can use a supervised learning framework. To obtain model inputs and outputs for training, we can (as always) turn to Wikipedia for help.

A Wikipedia article contains embedded links to other articles (called source documents), which elaborate concepts discussed in the article. Therefore, we can approach a wikipedia article as a summary of all these source documents. To create a summariser, we will feed our model a wiki article along with all its source documents, and gradually train it to spit out wiki articles.

A typical wikipedia article

The above is a daunting task, as source documents can have very long sentences. Modelling long-range dependencies has always been a nightmare in Natural Language Processing, but is required in summarisation tasks. When we summarise a piece of article, it is often important to connect pieces of information separated by many words.

“But I don’t want to go among mad people,” Alice remarked.
“Oh, you can’t help that,” said the Cat: “we’re all mad here. I’m mad. You’re mad.”
“How do you know I’m mad?” said Alice.
“You must be,” said the Cat, “or you wouldn’t have come here.”

Quote by Lewis Carroll from Alice in the Wonderland. The word “mad” is not mentioned in the final line, but we still know the Cat concluded that Alice must be mad. This is because we remember information from previous lines.

We will be approaching this summarisation task by using a tweaked version of the Transformer model, which has been successful in language translation.

The Transformer

The Transformer model includes an encoder and decoder that models the interaction of words in a sentence. For each word, the encoder “consults” all other words in the sentence to generate a new representation for the word. Next, the decoder combines the new representation of the word with its previous outputs to form an output representation of the word.

Encoder and decoder in action, using the Red Queen’s favourite quote. Dark purple circles are outputs of the decoder. They are formed from outputs of the encoder (light purple circles) and previous decoder outputs. Arrows for only the first two words are shown for simplicity.

The above figure shows one layer of an encoder and decoder. In the Transformer, both the encoder and decoder have six identical layers.

The underlying intuition of encoder and decoder is that there exists relationships between words in a sentence. However, not all relationships between words are equal. For example, the encoder output of “heads” is obtained by considering other words as inputs. Is every input equally important in creating the encoder output of “heads”? This question prompts the idea of attention.

Pay attention!

Whenever we talk about attention, we can think of it as the interactions between three components: (1) query word, (2) key word and (3) value.

In our example, the query word is “heads”. Other words in the sentence is the key word. We compare each key word to the query word, such that each key word is assigned a weight for its value.

Three main components of attention: Query, Key and Value.

This weight quantifies how important a key is to a query. Since queries and keys are nothing but representation of words, they are vectors (or maybe matrices). Similarities between two vectors are measured using dot product, hence:

Mathematically speaking, this is what attention is

We say the output of attention is a weighted sum of values. Note that “heads” is not always the query word, as each word in the sentence will take turn to be the query word.


Now that we have explained how a Transformer works, we will try to adapt it to our summarisation task. The encoder-decoder model from the Transformer works well for language translation tasks where input and output words are of two different languages. Since our task is a monolingual summarisation problem, information contained in the encoder and decoder may be redundant. Therefore, we will drop the encoder module from our model.

We will instead combine our input and output sentence into a single sequence. We put the first L words from our source document (input) and words from our Wiki article (output) into an array.

For example, if we want to map the sentence “Off with their heads!” to “Red Queen”, we will have:

The model has to predict the next input given previous inputs. Compare this to the decoder in the original transformer model.

You might have noticed that we are predicting a word based on the words before it. This means when our query word is “their” , our keys will be “Off” and “with”. Therefore, masking is added into our attention architecture so that words after “their” will not be keys. In other words, they will not receive any attention. This further decreases the number of things our model has to do.

Long, long sentences

Our model inputs are source documents, which can have many, many words. Just imagine performing attention over every word in a source document – our model will need to remember a lot of weights and values. This can be dealt with by using memory compressed attention, which is similar to performing a one dimensional convolution.

How does the idea of convolution come in?? While performing attention, we quantify the relationship between query and value words by finding the dot product between their vectors. We can think the query vector as our kernel/filter sliding along the key vectors.

Moreover, we can think of it as a strided convolution, which is what we know as memory compressed attention. This reduces the number of key-value pairs our model has to remember.

Image from Stanford CS231n: 1-dimensional convolution with stride size = 3.

In a nutshell…

If you slept through the entire article, here are the main takeaways:

  1. We frame a summarisation problem as a wikipedia generating problem.
  2. The underlying intuition of the encoder and decoder is that there exists relationships between words in a sentence. However, not all these relationships are equal. Hence, we perform attention for each word.
  3. To create a wikipedia generating model, we use the Transformer model that has been successful in language translation tasks. We change the model by dropping the encoder module, performing memory compressed attention and masking.

Additional References

  1. The above article is based on this research paper.
  2. For a detailed explanation on the Transformer, and a good animation of the encoder and decoder in action, see here.

Rowen is a research fellow at Nurture.ai. She believes the barrier to understanding powerful knowledge is convoluted language and excessive use of jargons. Her aim is to break down difficult concepts to be easily digestible.

When machines write wikipedia was originally published in Nurture.AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium