Recurrent Neural Network with LSTM

Original article was published by Deep Patel on Artificial Intelligence on Medium


Recurrent Neural Network with LSTM

Starting from scratch, this blog will help you knowing RNN at its peak in the easiest way along with the importance of LSTM.

Source: Author (Design from Microsoft Powerpoint)

The following blog will answer these questions:

  • The need for RNN?
  • What are RNNs?
  • How RNNs work?
  • Problems with RNNs?
  • What is LSTM in RNNs?

Let me begin this article with a question — Which of the following sentence makes sense?

  • neural why recurrent need we network do
  • why do we need a recurrent neural network

Its obvious that the second one makes sense as the sequence of the sentence is preserved. So, whenever the sequence is important we use RNN. RNNs in general and LSTMs, in particular, have received the most success when working with sequences of words and paragraphs, generally called natural language processing.

Use of RNNs in following fields:

Source: Google
  • Text data
  • Speech data
  • Classification prediction problems
  • Regression prediction problems
  • Generative models

Some of the famous technologies using RNN are Google Assistance, Google Translate, Stock Prediction, Image Captioning, and similarly many more.

Generally, we don’t use RNN for a tabular dataset (CSV) and image dataset. Although NLP is mostly used in text processing RNN comes into picture when we need the sequence of words in a sentence.

What are RNNs?

In a traditional neural network, we assume that all inputs (and outputs) are independent of each other. But for many tasks that are a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being dependent on the previous computations.

RNN has a “memory” which remembers all information about what has been calculated. It uses the same parameters for each input as it performs the same task on all the inputs or hidden layers to produce the output. This reduces the complexity of parameters, unlike other neural networks.

How RNNs work?

Let’s suppose that the neural network has three hidden layers. The first one with weight and biases as w1, b1; the second with w2, b2; and the third with w3, b3. This means that each of these layers is independent of each other, i.e. they do not memorize the previous outputs.

Now, we need to convert this simple neural network into RNN. This is done by assigning the same weight and biases to all the hidden layers. Thus reducing the complexity of increasing parameters and memorizing each previous outputs by giving each output as input to the next hidden layer.

Source: Acadglid
  • Let’s say we assigned a word W1 to the 1st hidden layer at time t1.
ht -> current state
ht-1 -> previous state
xt -> input state
  • Then calculate its current state using set off current input and the previous state.
whh -> weight at recurrent neuron
wxh -> weight at input neuron
  • We apply an activation function to it.
Yt -> output
Why -> weight at output layer
  • Then calculate the output of the current layer to pass it as an input to the next layer.
  • For the next hidden layer, we pass word W2 at time t2 and the same process occurs.
  • One can go as many time steps according to the problem and join the information from all the previous states.
  • Once all the time steps are completed the final current state is used to calculate the output.
  • The output is then compared to the actual output i.e the target output and the loss function is then calculated.
  • After this backpropagation takes place where the weights and biases are being updated to reduce the loss function.

Problems with RNNs

  • Vanishing Gradient
  • Exploding Gradient
  • Disability to process a longer neural network

The gradient computation involves recurrent multiplication of weights W. Multiplying by W to each cell has a bad effect. Think like this: If you a scalar (number) and you multiply gradients by it over and over again for say 100 times, if that number > 1, it’ll explode the gradient and if < 1, it’ll vanish towards 0.

Vanishing gradient problem is far more threatening as compared to the exploding gradient problem, where the gradients become very very large due to a single or multiple gradient values becoming very high.

The reason why the Vanishing gradient problem is more concerning is that an exploding gradient problem can be easily solved by clipping the gradients at a predefined threshold value.

Also, RNNs are unable to process longer neural networks using the tanh or relu activation function. It means that RNNs cannot perfectly predict the last word of the sentence.

Fortunately, there are ways to handle the vanishing gradient problem as well. There are architectures like the LSTM(Long Short term memory) and the GRU(Gated Recurrent Units) which can be used to deal with the vanishing gradient problem.

What is LSTM in RNNs?

Long Short-Term Memory (LSTM) is the complex RNNs that makes it easier to remember the memory and easily solve the vanishing gradient problem. Long Short-Term Memory (LSTM), as the name suggests, long memory for a short period, makes RNN capable of learning order dependence in sequence prediction problems. LSTMs train model using three gates via backpropagation.

Forget Gate

This is the first gate that is implemented. Forget name decides to forget the unnecessary information. For eg: As soon as a full-stop has occurred, forget the gate tries to forget the previous information as it is of no use now.

The information that is no longer required for the LSTM to understand things or the information that is of less importance is removed via the multiplication of a filter. This is required for optimizing the performance of the LSTM network.

Input Gate

The input gate decides what new information is needed to be added to the cell. With sigmoid (0, 1), it tells which new information is needed to be added to the cell. Further, with tanh function, it gives weightage to the values (in range -1 to 1) added by assigning them the vectors.

Output Gate

And lastly, from the values received from the sigmoid function (0, 1) and multiplying with tanh function output, we get the desired output from the cell. This is further passed to the next cell and the process continues till the model is trained.

If, this complete process looks difficult to you, but don’t worry guys, KERAS had made it simple to use LSTM. Check the code for simple sequential model with LSTM below.

#The code for the LSTM model in KERASfrom keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# sample LSTM model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

So, this was the complete theoretical part of the RNN and LSTM with a sample code in the end.

Thanks for reading, and I will surely come up with new deep learning blog soon. Till then, See ya!