Time Series Analysis using ARIMA and LSTM(in Python and Keras)-Part2

Original article was published by Parijat Bandyopadhyay on Deep Learning on Medium

To convert the RNN network in Fig-33a to an LSTM, we need to convert the RNN-cells in the hidden layer with LSTM cells (refer fig-34a). An RNN cell is made up of only a tanh, NLF but an LSTM cell has σ(*3), tanh(*2), Multiplication (*3), and Addition(*1), and that forms four gates, namely, Forget, Learn, Remember, and Use.

ht is the short-term memory and Ct is the long-term memory.

As shown in Fig-34b, when Ct-1 and ht-1 enter an LSTM cell, the forget-gate forgets a few things, and the learn-gate learns a few things, and then the remember and use gates get updated using both the forget and learn gate. The use gate supplies the short-term memory, and remember-gate supplies the long-term memory to the LSTM network at the time, t. The mathematical construct of the LSTM allows to carry a long-term memory and does not suffer from vanishing gradient problem during backpropagation. I am not including the equations to keep the article less math-heavy and intuitive. The Fig-34c shows the box model of an LSTM.

Now, let’s apply LSTM to the Airplane dataset. It would be pertinent to mention here that for the LSTM analysis no special consideration is required for considering the seasonality.

Let’s split the dataset into train and test(similar to the ARIMA analysis)

df_train = df[:-24]
# only taking the NOP (No.of passangers)column

Next, let’s perform the feature scaling on the train data and see the data after scaling

from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0,1))
training_scaled = sc.fit_transform(trainset)
fig, (ax0,ax1) = plt.subplots(ncols=2, figsize=(20,8))
ax0.set_title('Original Distributions')
ax1.set_title('Distributions after MinMax scaler')
sns.kdeplot(df_train['Nop'], ax=ax0)
sns.kdeplot(pd.DataFrame(training_scaled,columns=['Nop'])['Nop'], ax=ax1)
Fig35: Train data distribution before and after scaling

It’s good to see that the data distribution is still intact after scaling.

We are getting the training_scaled array at the end of fit and transform operations of the feature scaling.

Now let’s import Tensorflow and Keras.

import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

I have kept the LSTM kernel structured to run it in both script and notebook mode and do more experiments with the sliding window. We will discuss the window-size experiment a bit later.

window_max=4 # is the sliding window size  
# Tune this Hyper-parameter # It's a monthly dataset

What do I mean by the sliding window? Let’s discuss:

Our univariate time series data is as shown below:


where tn=nth time dot and xn=the value of the variable(in our case NOP) at the nth time dot

Now to train the LSTM algorithm, the data should be transformed as below:






Now, x_train =[(x1,x2,x3),(x2,x3,x4),(x3,x4,x5),…………,(xn-3,xn-2,xn-1)]


Here, the window size is 4, and we are sliding it with a stride of 1. In a window, the first 3 elements go to the x_train, and the last one goes to the y_train. We will feed(x1,x2,x3) to LSTM and try to predict x4; we will feed(x2,x3,x4) to LSTM and try to predict x5; we feed (xn-3,xn-2,xn-1) to predict xn. In each epoch(or mini-batch), the algorithm calculates error with the predicted and actual(y_tarin) values and, using gradient descent(or mini-batch gradient descent), does backpropagation to update the Wx, Wy, and Wh, to minimize the error in the next epoch( or mini-batch). Here we will use a mini-batch size of 32 and backpropagate after each mini-batch to update W’s.

We are building x_train and y_train from the training_scaled array.

# Preparing the training data for LSTM

x_train = []
y_train = []
# %% [code]
for i in range(window,len(df_train)):
x_train.append(training_scaled[i-window:i, 0])
x_train,y_train = np.array(x_train),np.array(y_train)

Furthermore, note the dimension of x_train:

x_train.shape[0]=[len(training_scaled)-window size+1](rows), and

x_train.shape[1]=[window size](cols)

The size of the y_train: x_train.shape[0] x 1(rows x cols)

There is a reason for discussing the dimension here, the LSTM takes the input in the format of a 3D matrix. We got a 2D train matrix. Let’s convert it to 3D:

x_train = np.reshape(x_train, (x_train.shape[0],x_train.shape[1],1))

Now the shape is:( x_train.shape[0],x_train.shape[1],1)

Here, the present window size is 4. So, we predict each 4th-month NOP value with the help of three previous months’ NOP values. You can increase the window_max variable and register MAPEs of the test and train and decide on the window size as a larger window size means higher memory retention for LSTM. After a certain window size, it may not behave as expected. Just one more thing to mention here:


While running the for-loop (during the window-experiment), we will train differently in each run and forget everything except the desired MAPE variable and the loop-counter. The above code chunk will release the memory in each run.

Now, let’s get into the neural net’s backbone and make it stronger by performing some hyper-parameter tuning. First, let’s build the neural net.

# %% [code]
regressor = Sequential()
regressor.add(LSTM(units = 50,return_sequences = True,input_shape = (x_train.shape[1],1)))
# %% [code]
regressor.add(LSTM(units = 50,return_sequences = True))
# %% [code]
# regressor.add(LSTM(units = 50,return_sequences = True)) # Activate it to see effect
# regressor.add(Dropout(0.2))
# %% [code]
regressor.add(LSTM(units = 50))
# %% [code]
regressor.add(Dense(units = 1))
# only one output for regression value with NO Activation Function
# %% [code]

Though I am assuming a basic knowledge of the FFNN as a prerequisite, I want to discuss a few things here. Let’s browse the above code block from the top. The first LSTM hidden layer has 50 LSTM cells. We are mentioning the input shape by providing only the last two dimensions of the x_train. The most important thing to note- we are adding/stacking 3 more LSTM layers by supplying return_sequences=True. Note, the last LSTM layer have a default return_sequences=False. Adding a 20% dropout for each layer as a precaution to the overfitting. The second last LSTM layer is disabled for your experiment.

Lastly, we attach a dense layer without any activation function and with a single output as its regression problem. The .summary() function summarizes the network info.

35a: NN Summary

Now, the next step is to introduce the optimizer, declare an initial value for the learning rate(a suitable value for the learning rate, lr, to be found by hyperparameter tuning during fitting), and compile the NN model. Adam is used as the optimizer here.

Use the bcode block below for lr tuning using Keras learning-rate -schedular function. The initial value is 1e-07, and for subsequent batches, it gets recalculated and passed to the optimizer via the lr_schedule variable.

# %% [code]
# lr_schedule = keras.callbacks.LearningRateScheduler(
# lambda epoch: 1e-7 * 10**(epoch / 20))
# activate this code during lr tuning# %% [code]
# opt = keras.optimizers.Adam(lr=1e-7)
# use this code during lr tuning

Now the model needs to be compiled and to be supplied with the optimizer, loss type, and desired metrics(MAPE in our case)

regressor.compile(optimizer = opt, loss = 'mse',metrics=['mae','mape'])# %% [code]
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='loss',mode='min',patience=20)
# %% [code]
mc = tf.keras.callbacks.ModelCheckpoint('best_model.h5', monitor='loss', mode='min', verbose=0, save_best_only=True)
# %% [code]
hist = regressor.fit(x_train, y_train, epochs=150, batch_size=32,callbacks=[mc, lr_schedule,early_stopping])

I have also used Keras, early-stopping function to terminate the fitting process if there is no change in loss values in more than 20 iterations. It waits till 20 iterations before termination.

The model-checkpoint function saves the best model with the least loss, like the grid search.

The early_stopping,lr_schedule, and model checkpoint are passed as a list to the callbacks parameter during the fitting.

Lastly, train data, epochs, and mini-batch size are passed to the .fit() method.

We can get all the metrics for train and validation(test here) after fitting as below:


We can plot some metrics: the epoch vs loss plot.

Epochs vs Loss Plot

When tuning lr, we can also plot lr vs loss.

%% [code]
##use this part during lr design only
plt.semilogx(hist.history["lr"], hist.history["loss"])

We can keep a window size of 2, 4, and 8 and compare the best learning rate values in each case.

for window in range(4,window_max+1):
Fig36: loss vs lr values (left:window-size=8, right:window-size=4)

I have experimented for 4 and 8 and found a stable region between 1e-03 and 1e-02 for both the cases. Clearly, the lr value depends on the window-size. I kept a window-size of 4 as it gives a wide stable region from 1e-03 to 1e-01. I froze lr to a value of 0.5e-02(this will give bad result for a window size of 8, and that’s my intention.)

Replace the corresponding code with the below code once the best lr value is decided. You can keep both and disable one of those.

opt = keras.optimizers.Adam(lr=0.5e-2)
hist = regressor.fit(x_train, y_train, epochs=100, batch_size=32,callbacks=[mc,early_stopping],verbose=0)

Now, compile and fit the model for the best lr value and a window size of 4.

After getting the final fitted model, let’s check the train MAPE.

# %% [code]
from keras.models import load_model
# %% [code]
# %% [code]
prediction_train = sc.inverse_transform(prediction_train)
# %% [code]
# %% [code]
# %% [code]
# %% [code]
# %% [code]
#metrics= regressor.evaluate(x_train,y_train)
# %% [code]
# %% [code]

In the above code block, we import the best model, ‘best_model.h5’, to predict with x_train. We have got the prediction as prediction_train, and inverse transformed both prediction_train and y_train and found a MAPE of 8.756%

# Error Function
def errors(actual,prediction):
error=m.result().numpy() # MAPE
error1=n.result().numpy() # MAE
return ({'MAE':error1 ,'MAPE':error})
Fig37: Actual and Prediction plot for the train data

Fig37 shows the actual and prediction plot for the train data.

The next step is to check the validation MAPE.

Let’s prepare the test data compatible for supplying it to the LSTM.

We are taking a window-size of data from the rear of the train set and attaching it to the beginning of the test set to predict the first test data point based on the train data’s last three values.

# %% [code]

Let’s plot the test data.

Fig38: Raw Train and Test data
Fig38a:(Right: Raw Test data, Left: Test data prepared for LSTM)

The rest of the process is the same as the train data.

# %% [code]
# %% [code]
testset = testset.reshape(-1,1)
# %% [code]
testset_scaled = sc.transform(testset)
# %% [code]
# %% [code]
for i in range(window,len(inputs)):
x_test,y_test = np.array(x_test),np.array(y_test)
# %% [code]
x_test = np.reshape(x_test, (x_test.shape[0],x_test.shape[1],1))
# %% [code]
prediction_test = regressor.predict(x_test)
# %% [code]
prediction_test = sc.inverse_transform(prediction_test)
# %% [code]
y_test = sc.inverse_transform(y_test.reshape(-1,1))
# %% [code]

Let’s see the validation plot(Fig39).

Fig39: Validation Plot

The validation MAPE for LSTM is 8.763%. It’s much better than the MAPE of 14.93% we obtained from the ARIMA model but the Auto Arima technique is a much faster approch to create a baseline result.

Now, it’s time to do the window-size variation experiment (keeping the best value for the lr constant at 0.5e-02). Let’s run the loop for a range(1,13) and plot the result.

plt.plot(*zip(*trainError_MAPE),label='Train Error')
plt.plot(*zip(*testError_MAPE),label='Test Error')
plt.legend(['Train Error', 'Test error'], loc='upper left')
plt.xlabel('window size')
Fig40: Window-size vs MAPE plot

We can observe from Fig40 that in the window-size(WS) zone 1–4, the pattern is desired, i.e., validation loss> training loss. Again in zone 6–8, it’s consistent and follows the desired pattern, but from WS=9, it shoots up (though the desired pattern is followed). Both WS=5 and 10 shows similarity though the magnitude is different. At WS=12, the training loss is shallow, and validation loss is very high. Maybe it’s overfitting. These results perfectly justify the plot in Fig36, i.e., we get different stable lr regions for different window sizes and should choose optimized lr values accordingly and then only decide on the final window size.

I hope you liked it, and thanks for your time.


Udacity Free Course: Intro to Tensor Flow for Deep Learning