Source: Deep Learning on Medium
Suppose you are watching a movie. At any point, you are able to relate a scene to a previous scene in the movie because you have watched the movie until that point, i.e you have the context and you remember everything you watched until that scene. If you see, this is how humans think. You don’t think from scratch every second. You understand everything based on your previous understanding. Your thoughts have persistence. This is exactly what Recurrent Neural Networks(RNN) do.
In a traditional neural network, every input is independent of each other. They don’t have a context, and it makes it hard to use them for a lot of scenarios. RNN’s, on the other hand, use their internal state(memory) to process any sequence of inputs. This makes them useful for sequence tasks which require any kind of previous context like Speech Recognition, Music Generation, etc.
Recurrent Neural Network Model
As we can see above, a RNN model reads inputs one at a time, and remembers some information/context through hidden layer activations that get passed from one time-step to the next. This allows an RNN to take information from the past to process later inputs. We also have something called a Bidirectional RNN, which can take context from both the past and the future.
An RNN can be seen as a repetition of a single RNN cell. Let’s look at a basic RNN cell.
Every RNN cell takes the current input x<t> and the previous hidden state, which contains information from the past a<t-1>, combines them and applies an activation function to it(usually tanh) to get a value a<t>, which is passed to the next RNN cell and also used to predict the output y<t>. We can see here that at every time step, we use the information from the past and the current input to form the hidden state input for the next time step. This is how an RNN cell remembers the context of the previous hidden states.
A complete RNN consists of repetition of the cells above. If we have a sequence of data with 5 time steps, we will have 5 copies of the hidden cells and this entire process above is the forward propagation of the RNN.
Backpropagation in RNN is called Back Propagation Through Time (BPTT). At every time step, we calculate the loss/cost(example, Cross Entropy Loss) and the total loss is sum of the losses at each individual time step. We then update the parameters at every time step using some optimization techniques like Gradient Descent. So, backward propagation in RNN is all forward propagation steps going in the opposite direction. If you imagine, this is like going backwards in time to do something (update the parameters here), which kind of gives us a cool feeling.
Different Types of RNN
Till now, we have seen an RNN where the length of input sequence is equal to the length of output sequence. However, there can be a lot of different types of RNN.
- One to one: This is the general Vanilla mode of processing without RNN, having a fixed-sized input and a fixed-sized output. Example – Image Classification.
- One to many: We can also have a fixed-sized input and get a sequence as the output. Example – Image Captioning, where we have an image as an input and we caption/describe the image.
- Many to one: This type of RNN has a sequential input and and a single fixed-size output. Example – Sentiment Analysis where we a classify a given sentence as a positive or negative sentiment.
- Many to many: This type of RNN has a sequential input as well as a sequential output. We can have two variants of many to many RNN’s. First, where the lengths of the input and output sequences are same. Second , where the lengths of the input and output sequences are different. An example of many to many RNN could be Machine Translation, where you read a sentence of words in one language and output the same sentence in another language.
Though RNN’s can be extremely useful as they connect previous information to the current task, they have a few disadvantages.
The Vanishing Gradient Problem
Let’s consider two sentences.
- The dogs were barking.
- The dogs owned by Mrs. Smith realized that there were men inside the house and were barking.
Let’s consider that we are trying to build a language model, i.e, trying to predict the next word based on the previous ones. We, as humans understand that in both the sentences, we have to use the word ‘were’ because dogs is a plural word. In cases like the first sentence, where the gap between the word dogs and were is small, RNN’s learn to use the past information.
However, there also cases like the second sentence, where the word ‘dogs’ and ‘were’ are really far apart. In theory, RNN’s should be able to learn to connect the two words and should predict the word ‘were’ instead of ‘was’. However, in practice, RNN’s are not able to learn the information when the gap is large. Like in the case of Deep Traditional Neural Networks, where it is difficult to backpropagate to the earlier layers to affect their weights, it is very difficult to backpropagate from a later time step to an early time step, and this is called the Vanishing Gradient Problem in RNN. This problem is solved by using GRU/LSTM, which I will explain in my next post.
RNN’s also have the exploding gradient problem where the gradient becomes really large and we get NaN’s as output. This could be catastrophic, but we can generally avoid this by using techniques like Gradient Clipping.
I hope that with this post, you have understood about how RNN’s work and where they can be used. RNN’s are really powerful and robust, and I believe that they will always find ways to surprise you with their outputs.
Do leave a comment below if you have any questions or suggestions :)