Recurrent Neural Networks (RNNs)

Original article was published by Iprathore on Artificial Intelligence on Medium

Recurrent Neural Networks (RNNs)

Recurrent neural networks are variants of the vanilla neural networks which are tailored to learn sequential patterns.”

Wonder how 🧐 Google’s autocompleting feature predicts the rest of the words a user is typing –

A collection of large volumes of most frequently occurring consecutive words is stored in a database, and this data is fed to a recurrent neural network. The network analyzes the data by finding the sequence of words occurring frequently and builds a model to predict the next word in the sentence.

Some other Applications –

  • Sentiment Classification
  • Image Captioning
  • Language Translation
  • Time Series Prediction

Why do we need RNN ?

In sequential data, entities occur in a particular order. If you break the order, you don’t have a meaningful sequence anymore. There are multiple such tasks in everyday life which get completely disrupted when their sequence is disturbed.

For example, you could have a sequence of words which makes up a document If you jumble the words, you will end up having a nonsensical document. RNN captures the sequential information present in the input data i.e. dependency between the words in the text while making predictions. Similarly, you could have a sequence of images which makes up a video. If you shuffle the frames, you’ll end up having a different video.

A recurrent neuron stores the state of a previous input and combines with the current input thereby preserving some relationship of the current input with the previous input.

The main difference between normal neural nets and RNNs is that RNNs have two ‘dimensions’ — time t (i.e. along the sequence length) and the depth l(the usual layers). In fact, in RNNs it is somewhat incomplete to say ‘the output at layer l’; we rather say ‘the output at layer l and time t’ .(time here is not necessarily timestamp but the position ).

The most crucial idea of RNNs which makes them suitable for sequence problems is that:

The state of the network updates itself as it sees new elements in the sequence.This is the core idea of an RNN — it updates what it has learnt as it sees new inputs in the sequence. The ‘next state’ is influenced by the previous one, and so it can learn the dependence between the subsequent states which is a characteristic of sequence problems.

Equation for RNN🖋

Lets checkout maths behind it 🔬

Let’s quickly recall the feedforward equations of a normal neural network:


Wl : weight matrix at layer l

bl : bias at layer l , zl : input into layer l, fl : activation function at layer l and al : output or activations from layer l.

We say that there is a recurrent relationship between al(t+1) and its previous state al(t), and hence the name Recurrent Neural Networks.

We will provide input to the hidden layer at each step. A recurrent neuron now stores all the previous step input and merges that information with the current step input. Thus it also captures some information regarding the correlation between current data step and the previous steps. The decision at a time step t-1 affects the decision taken at time t.

Here, the weights and bias of these hidden layers are different. And hence each of these layers behave independently and cannot be combined together. To combine these hidden layers together, we shall have the same weights and bias for these hidden layers.

We can now combines these layers together, that the weights and bias of all the hidden layers is the same. All these hidden layers can be rolled in together in a single recurrent layer.

So they start looking somewhat like this

how various types of sequences are fed to RNNs 🎮-

Consider the following text string: “A girl walked into a bar, and she said ‘Can I have a drink please?’.

The bartender said ‘Certainly {}”. There are many options for what could fill in the {} symbol in the above string, for instance, “miss”, “ma’am” and so on. However, other words could also fit, such as “sir”, “Mister” etc. In order to get the correct gender of the noun, the neural network needs to “recall” that two previous words designating the likely gender (i.e. “girl” and “she”) were used.

For instance, first, we supply the word vector for “A” to the network F — the output of the nodes in F are fed into the “next” network and also act as a stand-alone output (h0). The next network (though it is really the same network) F at time t=1 takes the next word vector for “girl” and the previous output h0 into its hidden nodes, producing the next output h1 and so on.

We can say instead of giving all input togather like in ANN , we give the input one by one after another separated by ‘t’ time frame in sequence.

RNNs: Simplified Notations🤓

The RNN feedforward equations are:

This form is not only more concise but also more computationally efficient. Rather than doing two matrix multiplications and adding them, the network can do one large matrix multiplication.

RNN Architectures Types 🏗

There are four types of Recurrent Neural Networks:

  1. One to One : This type of neural network is known as the Vanilla Neural Network. It’s used for general machine Example -learning problems, which has a single input and a single output.prediction task like image classification.
  2. One to Many : This type of neural network has a single input and multiple outputs. An example of this is the image caption.

This type of architecture is generally used as a generative model. Among popular use of this architecture are applications such as generating music (given a genre, for example), generating landscape images given a keyword, generating text given an instruction/topic, etc.

3. Many to One : This RNN takes a sequence of inputs and generates a single output. Sentiment analysis is a good example of this kind of network where a given sentence can be classified as expressing positive or negative sentiments.

4. Many to Many : This RNN takes a sequence of inputs and generates a sequence of outputs. It further can be of two types :

  • Many to Many : Equal input and output length -In this type of RNN, the input (X) and output (Y) both are a sequence of multiple entities spread over timesteps. In this architecture, the network spits out an output at each timestep. There is a one-to-one correspondence between the input and output at each timestep. You can use this architecture for various tasks.ex-build a part-of-speech tagger where each word in the input sequence is tagged with its part-of-speech at every timestep.
  • Many-to-many RNN: Unequal input and output lengths — In the previous many-to-many artitecture, we had assumed that the lengths of the input and output sequences are equal. However, this is not always the case. There are many problems where the lengths of the input and output sequences are different. For example, consider the task of machine translation — the length of a Hindi sentence can be different from the corresponding English sentence.the encoder-decoder architecture is used in tasks where the input and output sequences are of different lengths.

The above architecture comprises of two components — an encoder and a decoder both of which are RNNs themselves. The output of the encoder, called the encoded vector (and sometimes also the ‘context vector’), captures a representation of the input sequence. The encoded vector is then fed to the decoder RNN which produces the output sequence.

The input and output can now be of different lengths since there is no one-to-one correspondence between them anymore. This architecture gives the RNNs much-needed flexibility for real-world applications such as language translation.

Backpropagation Through Time (BPTT)⏳

RNNs use a slightly modified version of backpropagation to update the weights. In a standard neural network, the errors are propagated from the output layer to the input layer. However, in RNNs, errors are propagated not only from right to left but also through the time axis.

Backpropagation breaks down in a recurrent neural network, because of the recurrent or loop connections.This was addressed with a modification of the Backpropagation technique called Backpropagation Through Time or BPTT.

In general we can say –

A recurrent neural network is shown one input each timestep and predicts one output.Conceptually, BPTT works by unrolling all input timesteps. Each timestep has one input timestep, one copy of the network, and one output. Errors are then calculated and accumulated for each timestep. The network is rolled back up and the weights are updated.

But loss calculation depends on the type of task and the architecture.

  • In a many-to-one architecture (such as classifying a sentence as correct/incorrect), the loss is simply the difference between the predicted and the actual label. The loss is computed and backpropagated after the entire sequence has been digested by the network.
  • On the other hand, in a many-to-many architecture, the network emits an output at multiple time steps, and the loss is calculated at each time step. The total loss (= the sum of the losses at each time step) is propagated back into the network after the entire sequence has been ingested.

We can now add the losses for all the sequences (i.e. for a batch of input sequences) and backpropagate the total loss into the network.

BPTT can be computationally expensive as the number of timesteps increases.If input sequences are comprised of thousands of timesteps, then this will be the number of derivatives required for a single update weight update.

Limitation of RNN😥

Training RNNs is considered to be difficult, in order to preserve long-range dependencies it often meets one of the problems called Exploding Gradients ( weights become too large that over-fits the model ) or Vanishing Gradients ( weights become too small that under-fits the model ). The occurrence of these two problems depends on the activation functions used in the hidden layer, with the sigmoid activation function vanishing gradient problem sounds reasonable while with rectified linear unit exploding gradient make sense.

RNNs are designed to learn patterns in sequential data, i.e. patterns across ‘time’. RNNs are also capable of learning what are called long-term dependencies. For example, in a machine translation task, we expect the network to learn the interdependencies between the first and the eighth word, learn the grammar of the languages, etc. This is accomplished through the recurrent layers of the net — each state learns the cumulative knowledge of the sequence seen so far by the network.

Although this feature is what makes RNNs so powerful, it introduces a severe problem — as the sequences become longer, it becomes much harder to backpropagate the errors back into the network. The gradients ‘die out’ by the time they reach the initial time steps during backpropagation.

You could still use some workarounds to solve the problem of exploding gradients. You can impose an upper limit to the gradient while training, commonly known as gradient clipping. By controlling the maximum value of a gradient, you could do away with the problem of exploding gradients.

But the problem of vanishing gradients is a more serious one. The vanishing gradient problem is so rampant and serious in the case of RNNs that it renders RNNs useless in practical applications.

To solve the vanishing gradients problem, many attempts have been made to tweak the vanilla RNNs such that the gradients don’t die when sequences get long. The most popular and successful of these attempts has been the long, short-term memory network, or the LSTM . LSTMs have proven to be so effective that they have almost replaced vanilla RNNs.

Soon , I will be writing an article on LSTM .

Thats all for now 🤗

If you liked the article, show your support by clapping for this article. This article is basically a colab of many articles from medium , analytical vidya , upgrad material etc.

If you are also learning Machine learning like me follow me, for more articles. Lets go on this trip together 🙂

You can also follow me on Linkedin