Abstractive Text Summarization with NLP

Original article was published on Artificial Intelligence on Medium

Sequential Networks

New network architectures were discovered a few decades ago to deal with sequential data.


Recurrent neural networks are a new type of network, in which their layers are used recurrently, or repeatedly. This means the layers are all the same. The network takes in a part of the sequence for each time step and performs some calculation on it. Specifically, for each time step, it uses the previous time step’s hidden layer and a new part of the input sequence to make a new output. This is then passed to the next time step, along with the next part of the sequence.

For some time step, say 2, it takes in the vectors from the previous hidden layer (hidden 1) and the current input (input 2) to make a hidden result (hidden 2) and an output (output 2). The hidden result and output are the same vector. The operation inside the black box (hidden 2) is just a dot product followed by an activation function.

For each hidden layer, the weights and bias is the same. Hidden 1, 2, and 3 all use the same parameters, so we can train this for any sequence length and keep reusing the layers. The only difference between each hidden layer is that it receives different inputs, namely, the previous hidden layer and the input subsequence. The first hidden layer usually receives a vector of zeros as the hidden layer input.

Many outputs are created, and in different applications, we can whether choose to use them or not. Note that the outputs for a certain time step is exactly the same vector that is fed to the next time stamp as input. If we change the direction of the picture slightly, it is actually very similar to a normal neural network.

The difference between a normal neural network and a recurrent neural network is that new inputs are constantly being fed in, we sometimes use the outputs of the hidden layers, and the layers are all the same.


The issue with recurrent neural networks is that it has a hard time remembering information over a long period of time. The information it may want to remember mixes with new information after each time step, and becomes very diluted.

Since the previous hidden layer is only half of the input, the proportion of the previous information becomes exponentially smaller as time steps pass.

This animation, by Michael Phi, explains this concept very well:

The first input word “word” becomes exponentially smaller through time.

The long short term memory network is a type of recurrent neural network that has the added ability to choose what is important to remember, and what it should forget. Therefore, it is useful in both long term and short term memory.

An LSTM hidden layer, which takes the previous memory cell (ct-1), previous hidden layer (ht-1), and the input(xt), and outputs hidden layer output (ht), and a new memory cell (ct). Source: Guillaume Chevalier, own work, CC BY 4.0, on wikipedia

The difference between the RNN and the LSTM is the memory cell. This is where important memory is stored for a long period of time. The memory cell is a vector that has the same dimension as the hidden layer’s output.

Instead of being changed at each time stamp, like the hidden layers are, the LSTM has very strict rules on changing the memory cell.

The left side of the LSTM layer represents changes to the memory cell.
  1. First, the previous hidden layer and the current input is passed to a layer, with a sigmoid activation function, to determine how much the memory cell is supposed to forget its existing value.
  2. Second, the previous hidden layer and the current input is passed to a layer, with a hyperbolic tangent activation function, to determine new candidates for the memory cell.
  3. Finally, the the previous hidden layer and the current input is passed to a layer, with a sigmoid activation function, to determine how much the candidates are integrated with the memory cell.

Note that the layers that decide what to forget and what to add are sigmoid layers, which output a number between 0 and 1. Since sigmoid is capable of outputting numbers very close to 0 and 1, it is very possible that memory is completely replaced.

Also note that the candidates are decided using the tanh function, which outputs a number between -1 and 1.

The left side of the LSTM layer represents how the memory cell changes the output of the whole layer.

After changes are made to the memory cell, the memory cell makes changes to the final hidden layer output.

The LSTM network is proficient at holding on to long term information, since it can decide when to remember information, and when to forget it.

Computers Suck English — They Are Only Good at Math

For a normal neural network to function, we must pass in some vectors as inputs, and expect some vectors as outputs. How can we do that when dealing with sequences of English text?

The answer, created in 2013 by Google, was an approach called Word2vec, that, unsurprisingly, mapped words to vectors. Continuous bag of words is the idea that two words are similar if they both appear in the same context (previous words), and skip-gram is the idea that two words are similar if they generate the same context (next words).

The vectors of similar words, like “poodles” and “beagles” would be very close together, and different words, like “of” and “math” would be far apart.

Another study by Stanford University in 2014 proposed a similar idea, but this time, stressing that words that appear in different frequencies should also be far apart, and words that appear about the same number of times should be close together.

The mapping of words to vectors is called word embeddings. They help us perform numerical operations on all kinds of texts, such as comparison and arithmetic operations.

Abstractive Text Summarizer

Combining the power of word embeddings and RNNs or LSTMs, we can transform a sequence of text just like a neural network transforms a vector.

  • To build a text summarizer, we first use word embeddings to map our input sequence words to a sequence of vectors.
  • Then, we use an autoencoder-like structure to capture the meaning of the passage. Two separate RNNs or LSTMs are trained to encode the sequence into a single matrix, and then to decode the matrix into a transformed sequence of words.
  • Lastly, convert the sequence of vectors outputted by the decoder back into words using the word embeddings.

This method can be generalized into transforming a sequence of text into another sequence of text. Building an abstractive text summarizer, we would give the model labelled examples, in which the correct output is a summary. If we wanted to build a translator, however, we would label each training example the translated text, instead of the summary.


  • A good text summarizer would improve productivity in all fields, and would be able to transform large amounts of text data into something readable by humans.
  • RNNs are similar to normal neural networks, except they reuse their hidden layers, and are passed in a new part of the input sequence at each time step.
  • LSTMs are special RNNs that are able to store memory for long periods of time by using a memory cell, which can remember or forget information whenever necessary.
  • Word embeddings capture the general usage of a word based on its context and frequency, allowing us to perform math on words.
  • An abstractive text summarizer would use an encoder and a decoder, surrounded by word embedding layers.
  • This summary was definitely not generated by a computer. This is because computers do not understand sarcasm.