DeepLearning series: Sequence Models

This blog will cover the different architectures for Recurrent Neural Networks, language models, and sequence generation. I will go over the details of Gated Recurrent Units (GRU) and Long-Short Term Memory Units (LSTM), which are part of a sequence model architecture.

There are various sequence models in the deep learning domain. Before we jump into those, let’s see the applications we can apply these models to.

  • Speech recognition: Input: audio clip -> Output: text
  • Music generation: Input: integer referring to genre (or an empty set) -> Output: music
  • Sentiment classification: Input: text -> Output: ratings
  • DNA sequence analysis: Input: DNA (alphabet) -> Output: label part of the DNA sequence
  • Machine translation: Input: text -> Output: text translation
  • Video activity recognition: Input: video frames -> Output: identification of the activity
  • Name entity recognition: Input: sentence -> Output: identify people within it.

Let’s start with some notations that will help us throughout this blog. I’ll use an example of an application for name identity recognition.

To represent words in a sentence, we come up with a vocabulary (dictionary) that lists all the words and assigns a sequential number to each one. You can find online dictionaries already prepared for you, which contain 100,000 words. If a word is not in the vocabulary, you can assign it to the <UNK> (“unknown”) token. Finally, we use a one-hot representation for each word as a vector of zeros and a one corresponding to the position of the word in the vocabulary list.

RNN (Recurrent Neural Network) model:

To learn the mapping from X to Y, we might use a standard neural network, where we feed the x<1>, x<2>, … x<t> to obtain y<1>, y<2>, … y<t>.

This doesn’t work well. A couple of problems are present:

  • The inputs and outputs for the different examples can be of different lengths. Each input is a sentence, so it’s fair to imagine that most of the training examples are sentences of different length.
  • The network doesn’t share features learned from different positions of the text. It doesn’t generalize well.

Well, then I guess it’s time to introduce the RNN and explain why it works best for applications dealing with sequence models. Here’s the architecture:

The RNN scans through the data from left to right and the parameters used for each time step (Wax) are shared. The horizontal connections are governed by Waa parameters, which are the same for every time step. The Way’s are the parameters that govern the output predictions.

One very significant characteristic that we notice is that when making a prediction y, the network uses information not only from the corresponding input x, but also all from the previous ones. For example, for prediction ŷ<3> it gets information not only from x<3>, but also from x<1> and x<2>.

The main weakness of this architecture is that it only uses information coming from earlier in the sequence and not anything that comes after. We’ll soon see that there’s a network for that: bidirectional RNN (BRNN).

Let’s see how the forward propagation looks for this network, so that we can familiarize with the architecture.

We can simplify the last two equations as such:

For each layer of the network, we will calculate the loss, and then sum the losses up to obtain the entire loss for the sequence.

After that, we can compute the backpropagation, which in this network is called “backpropagation through time” as we run back through the sequence. Yup, that’s right, like Martin McFly in “Back to the Future”! Sorry I got too excited.

I mentioned earlier that there are different types of RNNs. What we’ve have seen so far is the “many-to-many” architecture where Tx = Ty.

For an application such as sentiment classification we end up in a situation where Ty=1, so our RNN is of the “one-to-one” type, and the architecture is:

On the other hand, for an application as music generation, we have a “one-to-many” architecture, as the input can be an integer related to a genre, while the output is a piece of music:

In other applications, instead, when the input sequence length is different than the output sequence (Tx ≠ Ty), the architecture of the RNN reflects a “many-to-many” relationship. Think about a machine translation.

All right, now that we have seen the RNN models, it’s time to put them into practice!

One of the most essential tasks in Natural Language Processing is “Language modeling”. Let’s see how this is built.

Language model and sequence generation:

This model is also used in speech recognition, where the machine listens to what a human says and predicts the correct sentence, based on the probability of one sentence versus the other.

So, I gave it away already…. what a language model does is to estimate the probability of the particular sequence of words that it will output.

These are the steps we take to build the model:

– Take a training set: a large corpus of English (or whichever language suits you) text.

– Tokenize the input sentences. (This is what we did before when we assigned a token (a number) to each word from the vocabulary). Remember, if a word does not exist within the dictionary we can always replace it with the <UNK> token.

  • Map each word to a one-hot vector of indices.
  • Model when the sentence ends by adding an extra token called <EOS>.
  • Build an RNN to model the chance of these different sequences.

These steps should be straightforward, except for the last bullet point, which I am going to explain in detail. Let’s start with the model architecture:

As you see a<1> makes a softmax prediction to try to figure out what is the probability of the first word ŷ<1>. Then in the second step, a<2> makes a softmax prediction given the correct first word (y<1>). All of the following steps make a prediction based on the correct words that come before them.

After training, it can predict, given an initial set of words, what’ s the chance of the next word. So, given a new sentence (y<1>, y<2>, y<3>) it can tell the probability of this sentence:

Now that we have trained our RNN we can use it to generate novel sequences.

This generation is done with a little variant from the model above, allowing sampling of each word to generate noble sequences. Essentially, the input of each step, instead of being the y from the previous step will be a random sample from the previous step distribution. You see what I mean looking at the architecture below:

A fun thing to do is, for example, to train a network on a Shakespearean text and then use sampling to generate a novel sentence “inspired” by Shakespeare. I know Shakespeare would be proud!

So far we have built RNN on a word level, meaning the vocabulary is composed of words. We can also build a character level RNN, where the vocabulary is comprised of the individual character of the alphabet.

One of the advantages of this method is that we never encounter an unknown word. On the other hand, a disadvantage constitutes the computational cost to train such a network as they deal with much longer sequences. Furthermore, character level models are not so good at capturing long-range dependencies, meaning how the earlier part of the sentence affects the later part.

Vanishing/exploding gradients

Throughout the previous examples, you might have noticed that the output ŷ was mainly influenced by the values in the sequence close to it. On the other hand, there are situations when some sentences have long dependencies, meaning some words within the sentence are related to other ones much earlier in the sequence.

Think about a sentence where you have a subject, followed by many words, and then finally we have the verb, which is depending on the earlier subject.

Basic RNNs are not good at capturing these long-term dependencies.

It’s like what we have seen in a deep neural network, where the network has a difficult time propagating back to affect the weights of earlier layers.

Exploding gradients in an RNN are rare, but when they happen, they can be catastrophic, as parameters get very large. So, in a way, it’s kind of easy to spot when this is happening and fixing the situation.

The solution is to apply “gradient clipping”.

Therefore, take a look at the gradient vectors, and if it’s getting bigger than a set threshold, you can rescale the gradients.

Let’s focus on the most difficult problem: vanishing gradients. When the network “forgets” what happened earlier and does not propagate dependencies across the whole sentence.

Gated Recurrent Unit (GRU):

Here we are. GRU units have “memory cells” that allow an RNN to capture much longer range dependencies.

Let’s see the difference of a GRU unit compared to the regular RNN unit below:

So the GRUn unit has a new variable called c, which is a ”memory cell” that provides a bit of memory to remember words even further along the sentence.

At every step, this memory cell c is overwritten by č, computed using the activation function tanh of Wc.

The purpose of č is to replace c through the use of a gate Γu, which takes a value between 0 and 1, and it basically decides whether or not we update c with č.

When Γu = 0 then c<t> = c<t-1> . Therefore the value of c<t> is maintained across many time steps. This allows overcoming the problem of vanishing gradients.

On the other hand, when Γu = 1 then č <t> = c<t>.

The equations that govern this unit are:

Long-Short Term Memory Unit (LSTM)

LSTM is another, even more powerful, unit to learn very long-range connections in a sequence. The unit consists of three gates: the “forget”, “update” and “output” gate.

The forget gate plays the role of (1- Γu) that we saw in the GRU unit.

Additionally, this time, c<t> is different than a<t>.

These are the equations that govern the unit, followed by the visual architecture:

An LSTM unit is more powerful and flexible than a GRU unit, and it’s the more proven choice. GRU units, instead, are more recent and more straightforward, so it’s easier to build bigger models with them.

Bidirectional RNN (BRNN):

These networks take not only information from earlier in the sequence, but also from later on. They are basically like a patient listener. They “listen” to the whole sentence before making a prediction, which is nice, generally speaking. Not a lot of humans can do that!

Their disadvantage comes precisely from that characteristic, though, as they need the entire sequence of data before predicting. And we are impatient, as we want machines to respond quickly… we can’t help ourselves.

The graph is acyclic, and it is represented this way:

Deep RNN:

As you might have predicted, like in a CNN, we can have multiple hidden layers throughout the network. Here we have a stack of multiple layers of RNNs.

These networks can learn complex functions. We generally don’t have a lot of layers (three is a lot), because of the temporal dimension.

This blog is based on Andrew Ng’s lectures at

Source: Deep Learning on Medium