Source: Deep Learning on Medium

The goal of this article is to provide an overview of applying LSTM models and the unique challenges they present. I will explain some of the most important (and confusing) parameters, how to prepare data for an LSTM model, and the difference between stateful and stateless LSTM models. This discussion will revolve around the application of LSTM models with Keras. However, this article won’t go into detail about how LSTM models work in general. Thus, to fully understand it you should already be familiar with LSTM’s. For a general background, the post by Christopher Olah is a fantastic starting point.

A key reason for using LSTM/RNN models is that they operate over sequences of vectors; sequences in input, output or both. Thus, before applying an LSTM model it is necessary to know what your data looks like and what you are trying to achieve. Here are some examples of potential RNN schemes:

One example for the “one to many” case is determining a number in a given image. The input is a fixed length vector (the image) while the output consists of several numbers. In the “many to one” case the input can be a sequence of timesteps each containing information about e.g. stock market prices. We can use this information to predict the next timestep and the corresponding stock market price. In this article I will come up with some ideas using the “**many to one**” and “**many to many**” problem.

After determining the structure of the underlying problem, you need to reshape your data such that it fits to the input shape the LSTM model of Keras is expecting, which is:

*[samples, time_steps, features]*.

*samples*: the number of your training examples.*time_steps*: the number of previous steps you want to take into account to predict the current step. For instance, we can use the timesteps = 3 with ‘2019–03–24’, ‘2019–03–25’ and ‘2019–03–26’ to predict the answer of timestep ‘2019–03–27’ (See Figure 2). Further, the parameter defines the number of internal / hidden loops in an LSTM model i.e. the number of green boxes (Figure 1) is equal to the*time_steps*parameter. Thus, how often the inner/cell state is updated. The only exceptions are the cases “one to many” and “many to many” with more hidden loops than inputs and outputs (For such cases you might use ‘RepeatVectors’. This technique is frequently used for Encoder-Decoder LSTM models. See Brownlee for more information).*features*: The number of features in each input

**How to reshape the data?**

Let’s assume we have a dataset with three different features and a (binary) label column as shown in the following (All numbers were randomly generated and are for illustrative purposes only):

For the sake of the argument we use 100 samples each containing three features and one label. Thus, when separating the dataset into features and labels we have a feature dataset X with shape [100,3] and a label data set y with shape [100,]

Think about the features f1, f2 and f3 as information about the weather like pressure, humidity and temperature. The label column contains information such as whether it is raining or not. Furthermore, the rows are ordered by ascending time. In order to predict whether it is raining or not in the next timestep (2019–03–28) it would be useful to consider previous timesteps. Thus, in this context the input is a sequence of timesteps while the output can be either a single value (“many to one”) or a sequence (“many to many”). Keep in mind that we now must reshape the feature dataset such that it fits to the shape of the LSTM architecture (again, its [*samples, time_steps, features*]). Obviously, the number of samples is 100 and the number of features is 3, but what is about the time_steps parameter?

- What value should we choose for the time step value? 5? 10? 100? Unfortunately, there is no general answer and as usual in Machine Learning, it depends on the dataset. So just try different combinations and figure out what fits best for your context.
- For the sake of the argument let’s choose timestep = 3. How can we reshape the dataset such that it has the form [100,3,3]? Currently, we have 100*3 = 300 values so we can’t reshape the dataset to 100*3*3=900 values. But we can achieve this by shifting the data like Brownlee did in his article. By doing so we can consider the features/labels of x previous timesteps for predicting the label for the current timestep. I would like to give you some examples and I would like to add some ideas:

**a) Shifting the features:** (Largely described by Brownlee) We use the features of the previous *time_step* (e.g. 3) timesteps as well as the features of the current timestep to predict the current label. It looks like:

**b) Shifting the labels**: We use the labels of the previous *time_step* (e.g. 5) timesteps to predict the current label. It looks like:

**c) Shifting features and labels**: We use the features and labels of the previous *time_step* (e.g. 3) timesteps to predict the current label. Important: When we choose this shifting method, we can’t consider the features of the current timesteps because the *Input Shape* must be the same along each timestep. Unfortunately, we can’t provide a label for the current timestep because that’s what we want to predict and thus, while we provide x features and 1 label for each of the previous timesteps, we just provide x features and no label for the current timestep, resulting in a differing amount of information along the timesteps. Nevertheless, you can use this shifting method, which looks like:

I just want to mention that if we choose *time_step* = 1 it will be a “one to one” case like the outer left diagram in the Figure 1. In such a case the LSTM model is almost like a feedforward neural network. The only difference is the math in the hidden layer. In the LSTM model we still have the Forget Gate, Update Gate and Output Gate. Because of *time_step* = 1 we don’t have any repeating structure and thus, inner state and multiple outputs drop off.

**What’s happening in Keras’ LSTM layers?**

Now, let’s take a closer look of what is happening in a Keras LSTM layer for a “many to one” or “many to many” problem. Recall the information Christopher Olah provided in his blog: “*The key to LSTM is the cell state* […].* It runs straight down the entire chain.*” Thus, for each chain/sequence we have one cell / inner state. In Keras there is an important difference between **stateful** (*stateful=True*) and **stateless** (*stateful=False*, **default**) LSTM layers. In a stateless LSTM layer, a batch has x (size of batch) inner states, one for each sequence. Let’s assume we have 30 samples and we choose a batch size of 10. For each sequence/row in the batch we have one inner state leading to 10 inner states in the first batch, 10 inner states in the second batch and 10 inner states in the third batch. (See Figure 3). Critically, the inner state and all the outputs of one row / sequence is deleted when the sequence is processed i.e. when one batch is processed.

In a stateful LSTM layer we don’t reset the inner state and the outputs after each batch. Rather we delete them after each epoch, which literally means that we use and update **one internal state for one sequence across multiple batches**. Let’s assume we have 30 samples. Each sample contains the information of the last 10 timesteps each having two features. Thus, one sequence / row has 20 features in total. How can we use one sequence across multiple batches? By defining the *batch_input_shape(batch_size, time_steps, features)* in the LSTM layer like:

model.add(LSTM(50, batch_input_shape=(1,10,2), stateful = True)´

By choosing (*1,10,2*) as the parameter we process each sequence (*1*) over *10* batches each containing 2 features. Thus, every batch contains one timestep of the sequence (10*2 = 20). To be more precise, the first batch contains the first timestep (t-9) of the first sequence, the second batch contains the second timestep (t-8) of the first sequence etc. leading to the use of 10 batches in order to process the first sequence. While a stateless LSTM would reset the inner state between the batches, we **pass** the inner state and the outputs from one batch to the next batch. Why? Because we want to keep the information across multiple batches (Remember: The goal of an LSTM model is to connect previous information to the present task). See Philipp Eremy for more information. When should we use a stateful LSTM?

1. **Computational constraints**: For example, when it comes to video analysis, we frequently deal with frames on the order of 1000x1000x3 features. Additionally, a sequence can easily store the information of the last 1000 frames leading to an overall number of 1000x1000x1000x3 = 3,000,000,000 features per sequence! Because of the memory stress applied by such a sequence, we would like to split the sequence in multiple parts.

2. **Online Machine Learning**: When it comes to real-time analysis of data in connection with incremental learning, we don’t want to reset our inner state. Rather we would like to keep and update it continuously whenever a new input is collected.

The advantage of stateless LSTM models is that they are more performant. You need to train a stateful LSTM model manually one epoch at a time, resetting the inner state after each epoch with* reset_states()*.

Let’s look at one iteration for the first sequence/row of a LSTM layer with *time_step* = 5.

After the 5th Step we would either reset the inner state and all the outputs (if *stateful=False*) or we would keep the inner state and all the outputs (*if stateful=True*) and we would continue with the next sequence.

Furthermore, we need to define what the output should look like. The return_sequence statement can be used either to return a list containing one output for each input (*return_sequence =True*, “many to many”) or to return just the last output (*return_sequence=False*, “many to one”). In the first case the output’s shape is (batch_size, time_steps, unit) while we ignore the *time_steps* in the second case leading to the following output’s shape: (*batch_size, unit*)

**What about the batch size?**

There is no difference between choosing the batch size in an LSTM model and any other type of neural network. From a computational point of view, it is more efficient to update the weights in a neural network after a minibatch of examples than after each example. For the explanation and other advantages, see Deep Learning by Goodfellow, Bengio and Courville.

**Conclusion**

LSTM models can be overwhelming in the beginning. The theory is tough, they have many different parameters (even more than I have described in this article) and you can’t just throw your data into a model without carefully considering the shape of the data. Think about a reasonable number of time steps, test different parameters and ask yourself what your output should look like. You should clarify in the beginning whether you need a stateful LSTM model or whether you can modify your data such that a stateless LSTM model can be applied.