Source: Deep Learning on Medium
Hi All, welcome to my blog “Long Short Term Memory and Gated Recurrent Unit’s Explained — ELI5 Way” this is my last blog of the year 2019. My name is Niranjan Kumar and I’m a Senior Consultant Data Science at Allstate India.
Recurrent Neural Networks(RNN) are a type of Neural Network where the output from the previous step is fed as input to the current step.
RNN’s are mainly used for,
- Sequence Classification — Sentiment Classification & Video Classification
- Sequence Labelling — Part of speech tagging & Named entity recognition
- Sequence Generation — Machine translation & Transliteration
Citation Note: The content and the structure of this article is based on my understanding of the deep learning lectures from One-Fourth Labs — PadhAI.
In Recurrent Neural Networks at each time step, the old information gets morphed by the current input. For longer sentences, we can imagine that after ‘t’ time steps the information stored at the time step ‘t-k’ ( k << t) would have undergone a gradual process of transformation. During back-propagation, the information has to flow through the long chain of timesteps to update the parameters of the network to minimize the loss of the network.
Consider a scenario, where we need to compute the loss of the network at time step four L₄. Assume that the loss occurred due to the wrong computation of hidden representation at the time step S₁. The error at S₁ is due to incorrect parameters of the vector W. This information has to be back-propagated to W so that the vector will correct its parameters.
To propagate the information back to the vector W, we need to use the concept of the chain rule. In a nutshell, the chain rule boils down to the product of all the partial derivatives of hidden representations at the specific timesteps.
If we have more than 100 hidden representations for longer sequences then we have to compute the product of these representations for the back-propagation. Suppose one of the partial derivatives comes to be a large value then the entire gradient value will explode causing the problem of Exploding gradients.
If one of the partial derivatives is a small value, then the entire gradient becomes too small or vanishes making the network hard to train. The problem of Vanishing gradients
White Board Analogy
Consider that you have a whiteboard of fixed size, over time the whiteboard becomes so messy that you can’t extract any information from it. In the context of RNN for longer sequences, the hidden state representation computed will become messy and it will difficult to extract relevant information from it.
Since RNN’s have a finite state size instead of extracting information from all the timesteps and computing the hidden state representation. we need to follow the selectively read, write and forget strategy while extracting information from different timesteps.
White Board Analogy — RNN Example
Let’s see how selectively read, write and forget strategy works taking an example of sentiment analysis using RNN.
Review: The first half of the movie is dry but the second half really picked up the pace. The lead actor delivered an amazing performance.
The movie review started with a negative sentiment but from thereon it changed to a positive response. In the context of selective read, write and forget:
- We want to forget the information added by stop words (a, the, is etc…).
- Selectively read the information added by sentiment bearing words (amazing, awesome etc…).
- Selectively write hidden state representation information from the current word to the new hidden state.
Using the selective read, write and forget strategy we have control of the flow of information so that the network doesn’t suffer from the problem of short term memory and also to ensure that the finite-sized state vector is used effectively.
Long Short Term Memory — LSTM
LSTM’s are introduced to overcome the problems in vanilla RNN such as short term memory and vanishing gradients. In LSTM’s we can selectively read, write and forget information by regulating the flow of information using gates.
In the following few sections, we will discuss how we can implement the selective read, write and forget strategy. We will also discuss how do we know which information to read and which information to forget.
In the vanilla RNN version, the hidden representation (sₜ) computed as a function of the output of the previous time step hidden representation (sₜ₋₁) and current input (xₜ) along with bias (b).
Here, we are taking all the values of sₜ₋₁ and computing the hidden state representation at the current time (sₜ).
In Selective Write, instead of writing all the information in sₜ₋₁ to compute the hidden representation (sₜ). we could pass only some information about sₜ₋₁ to the next state to compute sₜ. One way of doing this would be to assign a value between 0–1 which determines what fraction of current state information to be passed on to the next hidden state.
The way we are doing selective write is that we multiply every element of sₜ₋₁ with a value between 0–1 to compute a new vector hₜ₋₁. We will use this new vector to compute the hidden representation sₜ.
We will learn oₜ₋₁ from the data just like we learn other parameters like U and W using parametric learning based on gradient descent optimization. The mathematical equation for oₜ₋₁ is given below:
Once we learn from oₜ₋₁ from the data, it is multiplied with the sₜ₋₁ to get a new vector hₜ₋₁. Since oₜ₋₁ is controlling what information is going to the next hidden state, it is called the Output Gate.
After computing the new vector hₜ₋₁ we will compute an intermediatory hidden state vector Šₜ (marked in green). In this section, we will discuss how to implement selective read to get our final hidden state sₜ.
The mathematical equation for Šₜ is given below:
- Šₜ captures all the information from the previous state hₜ₋₁ and the current input xₜ.
- However, we may not want to use all the new information and only selectively read from it before constructing the new cell structure. i.e… we would like to read only some information from Šₜ to compute the sₜ.
Just like our output gate, here we multiply every element of Šₜ with a new vector iₜ which contains values between 0–1. Since the vector iₜ is controlling what information flows in from the current input, it is called the Input Gate.
The mathematical equation for iₜ is given below:
In the input gate, we pass the previous time step hidden state information hₜ₋₁ and the current input xₜ along with a bias into a sigmoid function. The output of the computation will between 0–1 and it will decide what information to flow in from the current input and previous time step hidden state. 0 means not important and 1 means important.