A practical guide to RNN and LSTM in Keras

Original article was published by Mohit Mayank on Artificial Intelligence on Medium

A practical guide to RNN and LSTM in Keras

Photo by Daniele Levis Pelusi on Unsplash


After going through a lot of theoretical articles on recurrent layers, I just wanted to build my first LSTM model and train it on some texts! But the huge list of exposed parameters for the layer and the delicacies of layer structures were too complicated for me. This meant I had to spend a lot of time going through StackOverflow and API definitions to get a clearer picture. This article is an attempt to consolidate all of the notes which can accelerate the process of transition from theory to practice. The goal of this guide is to develop a practical understanding of using recurrent layers like RNN and LSTM rather than to provide theoretical understanding. For more in-depth understanding, I suggest this and this, which I recommend going through before reading this article. If you are ready, let’s get started!

Recurrent Neural Network

The complete RNN layer is presented as SimpleRNN class in Keras. Contrary to the suggested architecture in many articles, the Keras implementation is quite different but simple. Each RNN cell takes one data input and one hidden state which is passed from a one-time step to the next. The RNN cell looks as follows,

The flow of data and hidden state inside the RNN cell implementation in Keras. Image by Author.

The complete formulation of an RNN cell is,

here, h{t} and h{t-1} are the hidden states from the time t and t-1. x{t} is the input at time t and y{t} is the output at time t. The important thing to notice is that there are two weight matrices W{hh} and W{hx} and one bias term b{h}. Each of these matrices can be thought of as an internal 1 layer neural network with output size as defined in the parameter units , also bias has the same size. y{t} is raw h{t} and we don’t apply another weight matrix here, as suggested by many articles. This represents one individual cell of RNN, and sequential combination of cells (count equal to time-steps in data) creates the complete RNN layer. Remember the same weight matrices and bias are shared across the RNN cells. Finally, we can compute the number of parameters required to train the RNN layer as follows,

Notice that input is a tuple in format (time-steps, features) and that the parameters only depend on the features as we share the same weights across each time-steps. This can be checked by displaying the summary of a sample model with RNN in Keras.

Checkout the Params in simple_rnn_2, it’s equal to what we calculated above. The additional 129 which took the total param count to 17921 is due to the Dense layer added after RNN.

We can also fetch the exact matrices and print its name and shape by,

Points to note, Keras calls input weight as kernel, the hidden matrix as recurrent_kernel and bias as bias. Now let’s go through the parameters exposed by Keras. While the complete list is provided, we will look at some of the relevant ones briefly.

  • The first and foremost is units which is equal to the size of the output of both kernel and recurrent_kernel. It is also the size of bias term and the size of the hidden term.
  • Next, we have activation which defined the g() function in our formulation. Default is “tanh”.
  • Then we have {*}_initializer , {*}_regularizer and {*}_constraint parameters each for kernel, recurrent_kernel and bias. These could be ignored if you are not sure about them, as the default values are good enough.
  • use_bias is a boolean parameter which turns on or off the bias term.
  • dropout and recurrent_dropout is used to apply dropout probability to kernel and recurrent_kernel respectively.
  • return_sequence is a boolean parameter. When its “True”, the output shape of the RNN layer is (timestamp, feature) and when its “False” the output is only (features). This means, if its turn on, in output we return the y{t} from all time-steps, and if it’s off then we only return 1 y{t} (here from the last time-step). An additional caveat, don’t forget to add a TimeDistributed layer or Flatten layer after an RNN with return_sequence turned on before you add a Dense layer.
  • go_backwards is of boolean type and when its “True” the RNN process the data in reverse order. Default is “False”
  • return_state is of boolean type and when “True” it returns the last state in addition to the output. Default is “False”.
  • stateful is an important parameter. When turned “True”, Keras uses the same hidden state across batches for the same sample index. Understand it like this, we train our model for multiple epochs which is like iterations over the complete data. 1 epoch is 1 pass over the complete data. Now each epoch contains multiple batches which in turn contains multiple samples i.e. the individual data. Usually, after running on each sample in a batch, the state of the RNN cell is reset. But if we have prepared the data in a format such that across multiple batches, samples at a particular index are just an extension of the same sentence, we can turn stateful as “True” and it will equivalent to training all sentences at once (as one sample). We may do this due to memory constraint and hence if we cannot load complete data at one go. Default is “False”.

With the basic of RNN clear, let’s look into one architecture which is frequently created with RNN.

Deep Vertical RNNs

Stacking multiple recurrent layers on top of each other has been suggested to work better for multiple applications. This leads to a mesh-like structure, where while horizontal depth (visualize an unrolled RNN) is due to the time-steps, the vertical copies (stacking) are due to the new RNNs layers. This is called Seq2Seq modelling and are mainly used for — language translation, entity tagging, speech recognition — kind of applications where we have a sequence as input and output. That said, we can also stack multiple RNNs before finally applying a fully connected dense layer, this is an example of the sequence as input but flattened output. A sample code is,

It’s pretty simple as we have just added two new RNN layer to the previous code. But notice we turn return_sequence as “True” to an RNN layer if we want to stack another RNN on top of it. This is because the next RNN expects time distributed input and the output of each time-step of the previous RNN becomes the input to the upper RNN for the same time-steps. Here, while the trainable parameters for the 1st RNN remain same as suggested before, the 2nd and 3rd RNN have different parameters because the input size to these RNN is 128. This makes the training parameter for each of the next two RNN equal to,


Moving on to LSTMs, there are a bunch of very good articles on it like this and this. I would suggest having a look at them before moving further. Similar to the issue with RNN, the implementation of LSTM is little different then what is proposed in most articles. The main difference is, instead of concatenating the input and previous hidden state, we have different weight matrices which are applied to the both before passing them to 4 internal neural networks in the LSTM cell. This means we have doubled the number of matrices required (in reality it doubles the dimensions, but more on this later). The 4 matrices which are multiplied with input are called kernel and the 4 which multiply with the previous hidden state are called recurrent_kernel. To understand this better, let’s look at the formulation,

Here if you observe, we have a total of 8 weight matrices, and assuming each has the same size, we can say that in one way we are doing the same operations we did in RNN but now 4 times more. Hence the number of trainable parameters can now be calculated by,

And switching from RNN to LSTM as easy as replacing the respective function call, this can be seen by the following code,

Match the Params mentioned in lstm_1 with what we computed.

We can again extract all of the weights from the model by,

Here note that all of the 4 kernel matrices and 4 recurrent_kernel matrices are stored in 1 single monolithic matrix each (concatenated on column axis), hence the dimension is 128*4=512. Same is true for the bias term. Also, nearly all of the parameters used in RNN are applicable here. One additional parameters caveat is recurrent_activation which has a default value of “sigmoid” and is applied to the input, forget and output gate as suggested above in the formula. This leaves the actual activation which is applied to cell state and hidden state (with a default value of “tanh” ) as suggested above as well in the formula.


We have tried to cover some of the basic topics required to connect the theory and practical for recurrent layers in Keras. As a complete guide with all the intrinsic details will be too much for a single article, I think there are a lot of materials out there which explain the topic very well. What I really missed was some notes which connects the formulae I saw in the articles with what is really implemented in Keras, with some additional practical details. Hope this was of help!


All of the code from the article has been uploaded here.

For more articles like this visit my website and connect with me @ linkedin.