Original article was published by Bob Rupak Roy on Deep Learning on Medium

Hi how are you doing, I hope it’s great likewise.

Today we will start off with a topic LSTM, which is a powerful type of neural network designed and optimized to handle sequence of time series data.

**Long-Strong-Term Memory (LSTM)** is the next generation of **Recurrent Neural Network (RNN)** used in deep learning for its optimized architecture to easily capture the pattern in sequential data. The benefit of this type of network is that it can learn and remember over long sequences and doesn’t rely on pre-specified window lagged observation as input.

In Keras this is referred as stateful and involves settings the “Stateful” argument to “True” in LSTM layer.

**What is LSTM in brief?**

It is a recurrent neural network that is trained by using Backpropagation through time and overcomes the vanishing gradient problem.

Now instead of having Neurons, ** LSTM networks have memory blocks that are connected through layers.** The blocks of LSTM contains 3 non-linear gates that makes it smarter than a classical neuron and a memory for sequences. The 3 types of non-linear gates include

** a.) Input Gate:** decides which values from the input to update the memory state.

** b.) Forget Gate:** handles what information to throw away from the block

** c.) Output Gate:** finally handles what to be in output based on input and the memory gate.

Each LSTM unit is like a mini-state machine that utilizes a ”**memory**” cell that may maintain its state value over a longer time, where the gates of the units have weights that are learned during the training procedure.

There are tons of articles available in the internet about the workings of LSTM even the math behind LSTM. So here I will concentrate more for a quicker practical implementation of LSTM for our day to day problems.

Let’s get started!

First is data pre-processing step where we have to give structure the data into supervised learning that is X and Y format.

In simple words it identifies the strength and values of the relationship (positive/negative impact and the values derived is call quantification of impact) between one dependent variable(Y) and series of other independent variables X

For this example we have a retail sales time series data recorded over a period of time.

Now as u know supervised learning requires X & Y independent and dependent variable for the algorithm to learn /train, so we will first convert our data into such format

What we will do we will first take the sales data(t) in our first column than the second column will have the next months(t+1)sales data that we will use to predict. Remember X & Y independent and dependent variable format where we use Y to predict the data.

The code below will convert time series to supervised learning. And yes `df.fillna(0,inplace=True)`

refers replace NaN value with 0 values.

`#supervised learning function`

def timeseries_to_supervised(data, lag=1):

df = DataFrame(data)

columns = [df.shift(i) for i in range(1, lag+1)]

columns.append(df)

df = concat(columns, axis=1)

df.fillna(0, inplace=True)

return df

Here’s our sales data will look like after transforming it to supervised learning.

**The next step is to convert time series data to Stationary. **And our ‘sales_year.csv’ data is not stationary.

This means that there is a structure in the data that is dependent on time. We can see there is a increasing trend in the data

*Stationary data is easier to model and will very likely result in more skillful forecasts.*

The trend can be removed from observations, then use for forecasts later we can scale it to the original value for prediction.

We can easily remove a trend by differencing the data with diff() function from pandas that is the observations from the previous time step (t-1) is subtracted from the current observation(t). This will give us a series of difference.

#create a differences series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return Series(diff)#invert differences value

def inverse_difference(history, yhat, interval=1):

return yhat + history[-interval]

Now its time to normalize/scale the data.

LSTMs are a bit sensitive to wide spread scale of data. Even in all deep learning methods scaling the data range of -1 to 1 before fitting it to our algorithm is good practice that helps the algorithm to work faster and effectively. And yes scaling the data will not lose its original meaning from the data. We also call this as Normalization using MinMaxScaler pre-processing class function.

Even the default activation function for LSTMS is the ** hyperbolic tangent (tanh) which outputs values between -1 and 1 **which is the preferred range for the time series data.

`#transform scale`

X = series.values

X = X.reshape(len(X), 1)

scaler = MinMaxScaler(feature_range=(-1, 1))

scaler = scaler.fit(X)

scaled_X = scaler.transform(X)

Again we must invert the scale on forecasts to return the values back to the original scale.

`invert transform`

inverted_X = scaler.inverse_transform(scaled_X)

Its time to deploy LSTM.

By default an LSTM layer in keras maintains state between data within one batch. A batch of data is a fixed-sized number of rows from the training dataset that defines how many patterns to process before updating the weights of the network. *By default State in the LSTM layer between batches is cleared.*** Therefore we must make the LSTM stateful**. This gives us fine-grained control over when state of the LSTM layer is cleared, with reset_states() function.

**LSTM network expects the input data(X) to be [samples, time steps, features] format.**

`X = X.reshape(X.shape[0], 1, X.shape[1])`

We will use Sequential API to define the network. The shape of the input data must be specified in the LSTM layer using the “batch_input_shape” argument as a tuple that specifies the expected number of observations to reach each batch, the number of time steps and the number of features.

And the number of neurons also called **memory units or blocks**. Then we have 1 output layer Dense(1). In the compiling network, we must specify a loss function and optimization algorithm to calculate the loss and weight.

`model = Sequential()`

model.add(LSTM(neurons, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))

model.add(Dense(1))

model.compile(loss='mean_squared_error', optimizer='adam')

Once compiled we must control when the internal state is reset because the network is stateful. We must manually manage the training process one epoch at a time across the desired number of epochs.

By default, *the samples within an epoch are shuffled prior to being exposed to the network and again this is undesirable for the LSTM because we want the network to build up state as it learns across the sequences of observations.*

So we will disable the shuffling of samples by settings “shuffle” to “False”

We will also reset the internal state at the end of the training epoch, ready for the next training iteration.

`for i in range(nb_epoch):`

model.fit(X, y, epochs=1, batch_size=batch_size, verbose=0, shuffle=False)

model.reset_states()

The batch_size must be set to 1. Even also in the predict() function of the model must be set to 1 because we are interested in making one-step forecasts on the test data.

As we remember during training our model the internet state is reset after each epoch. While forecasting we will not reset the internal state between forecasts. In fact we would like the model to build up state as we forecast each time step in the test dataset.

If you are new to LSTM and still confused how LSTM works follow the link **Illustrated Guide to LSTM’s and GRU’s: A step by step explanation**** **for a clear explanation of the work flow of LSTM.

Now Let’s put all of the pieces together.

from pandas import DataFrame

from pandas import Series

from pandas import concat

from pandas import read_csv

from pandas import datetime

from sklearn.metrics import mean_squared_error

from sklearn.preprocessing import MinMaxScaler

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM

from math import sqrt

from matplotlib import pyplot

import numpy#supervised learning function

def timeseries_to_supervised(data, lag=1):

df = DataFrame(data)

columns = [df.shift(i) for i in range(1, lag+1)]

columns.append(df)

df = concat(columns, axis=1)

df.fillna(0, inplace=True)

return df

#create a difference series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return Series(diff)

#invert difference value

def inverse_difference(history, yhat, interval=1):

return yhat + history[-interval]

#scale train and test data to [-1, 1]

def scale(train, test):

# fit scaler

scaler = MinMaxScaler(feature_range=(-1, 1))

scaler = scaler.fit(train)

# transform train

train = train.reshape(train.shape[0], train.shape[1])

train_scaled = scaler.transform(train)

# transform test

test = test.reshape(test.shape[0], test.shape[1])

test_scaled = scaler.transform(test)

return scaler, train_scaled, test_scaled

#inverse scaling for the forecast value

def invert_scale(scaler, X, value):

new_row = [x for x in X] + [value]

array = numpy.array(new_row)

array = array.reshape(1, len(array))

inverted = scaler.inverse_transform(array)

return inverted[0, -1]

#fit an LSTM network to training data

def fit_lstm(train, batch_size, nb_epoch, neurons):

X, y = train[:, 0:-1], train[:, -1]

X = X.reshape(X.shape[0], 1, X.shape[1])

model = Sequential()

model.add(LSTM(neurons, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))

model.add(Dense(1))

model.compile(loss='mean_squared_error', optimizer='adam')

for i in range(nb_epoch):

model.fit(X, y, epochs=1, batch_size=batch_size, verbose=1, shuffle=False)

model.reset_states()

return model

#make a one-step forecast

def forecast_lstm(model, batch_size, X):

X = X.reshape(1, 1, len(X))

yhat = model.predict(X, batch_size=batch_size)

return yhat[0,0]

#load dataset

series = read_csv('sales_year.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)

#transform data to be stationary

raw_values = series.values

diff_values = difference(raw_values, 1)

#transform data to be supervised learning

supervised = timeseries_to_supervised(diff_values, 1)

supervised_values = supervised.values

#split data into train and test-sets

train, test = supervised_values[0:-12], supervised_values[-12:]

#transform the scale of the data

scaler, train_scaled, test_scaled = scale(train, test)

#fit the model

lstm_model = fit_lstm(train_scaled, 1, 3000, 4)

#forecast the entire training dataset to build up state for forecasting

train_reshaped = train_scaled[:, 0].reshape(len(train_scaled), 1, 1)

lstm_model.predict(train_reshaped, batch_size=1)

#walk-forward validation on the test data

predictions = list()

for i in range(len(test_scaled)):

#make one-step forecast

X, y = test_scaled[i, 0:-1], test_scaled[i, -1]

yhat = forecast_lstm(lstm_model, 1, X)

#invert scaling

yhat = invert_scale(scaler, X, yhat)

#invert differencing

yhat = inverse_difference(raw_values, yhat, len(test_scaled)+1-i)

#store forecast

predictions.append(yhat)

expected = raw_values[len(train) + i + 1]

print('Month=%d, Predicted=%f, Expected=%f' % (i+1, yhat, expected))

#report performance #raw_values[-12,] refers last 12 months/rows

rmse = sqrt(mean_squared_error(raw_values[-12:], predictions))

print('Test RMSE: %.3f' % rmse)

#line plot of observed vs predicted

pyplot.plot(raw_values[-12:])

pyplot.plot(predictions)

pyplot.show()

Well we can observe it pretty close, our predicted value with the actual values. We can also try with different set of settings to optimize the model accuracy. Check out my another article where i have applied simple LSTM with optimized settings ‘LSTMs for regression’

Next we will try **Multi –variate LSTM for time series.**

I hope you enjoyed.

My alternative internet presences, Facebook, Blogger, Linkedin, Medium, Instagram, ISSUU and my very own Data2Dimensions

Also available on Quora @ https://www.quora.com/profile/Bob-Rupak-Roy

Have a good day!