Recurrent Neural Networks

Original article can be found here (source): Deep Learning on Medium

Recurrent Neural Networks

Understand the intuition behind RNN!

Figure by Author


The goal of this article is to explore Recurrent Neural Networks in-depth, which are a kind of Neural Networks with a different architecture than the ones seen in previous articles (Link).

Concretely, the article is segmented in the following parts:

  • What RNNs are
  • Long Short-Term Memory (LSTM) networks
  • Implementation of RNNs to temporal series

What are RNNs?

As we have seen here, CNNs do not have any kind of memory, RNNs can fo beyond this limitation of ‘starting to think from scratch’ each time because they have some kind of memory.

Let’s see how do they work with a very visual example:


Let’s say that we live in an apartment and we have the perfect roommate, he cooks one different meal depending on the weather, sunny or rainy.

Figure by Author

So, if we codify these meals with vectors:

Figure by Author

And our Neural Network does the following:

Figure by Author

If we recall, neural networks learn some weights that can be expressed as matrixes, and those weights are used to make predictions. Ours will be as follows:

If it is a sunny day:

Figure by Author

If it is a rainy day:

Figure by Author

And if we take a look at our weight matrix, this time seen as a graph:

Figure by Author

Let’s see now what add RNNs following this example:

Recurrent Neural Networks

Let’s say that now our dear roommate not only bases the decision of what to cook on the weather but now simply looks at what he cooked yesterday.

The network in charge of getting to predict what the roommate will cook tomorrow based on what she cooked today is a Recurrent Neural Network (RNN).

This RNN can be expressed as the following matrix:

Figure by Author

So what we have is a:

Figure by Author

Let’s Make it a Little Bit More Complex

Imagine now that your roommate decides what to cook based on what she cooked yesterday and the weather.

  • If the day is sunny, she spends the day on the terrace with a good beer in her hand, so she does not cook, so we eat the same thing as yesterday. But
  • If it rains, she stays home and cooks.

It would be something like this:

Figure by Author

So we end up having one model that tells us what we are going to eat depending on what we ate yesterday and another model that tells us whether our roommate will cook or not.

Figure by Author

And the add and merge operations are the following:

Figure by Author
Figure by Author

And here you can see the graph:

Figure by Author

And that is how it works!

This example is from a great video which I recommend you check out as many times as you need to interiorize and understand the previous explanation. You can find the video here:

And what are RNNs used for?

There are several types:

Figure by Author

They are very good at making predictions, especially when our data is sequential:

Stock market forecasts

The values of a share depend largely on the values it had previously

Sequence generation
As long as data are sequences and data in an instant t depends on the data in the instant t-1.

Text generation

For example, when your cell phone suggests words. It looks at the last word you have written, and at the letters, you are writing at that moment to suggest the next letters or even words.

Voice recognition

In this case, we have the previous word recognized, and the audio that reaches us at that moment.

Long Short-Term Memory Networks

Let’s study now how the most popular RNN work. They are the LSTM networks and their structure is as follows:

Figure by Author

But first: Why are they the most popular ones?

It turns out that conventional RNNs have memory problems. Specially designed memory networks are incapable of long-term memory. And why is this a problem?

Well, going back to the problem of our roommate, for this example we just need to know what we ate yesterday, so nothing would happen.

Figure by Author

But imagine if instead of a three-course menu, I had 60 courses.

Figure by Author

Conventional RNNs wouldn’t be able to remember things that happened a long time ago. However, the LSTM would!

And why?

Let’s take a look at the architecture of the RNN and the LSTM:


Figure by Author


Figure by Author

It turns out that where RNNs have a single layer, LSTMs have a combination of layers that interact with each other in a very special way.

Let’s try to understand this, but first, let me explain the nomenclature:

Figure by Author

In the diagrams above:

  • A vector travels along each line, from the output of one node to the inputs of others.
  • The pink circles indicate element to element operations, such as vector sums, while the yellow boxes are neural layers that are learned by training.
  • Lines that join indicate concatenation, and lines that separate indicate that the same line content travels to two different destinations.

The key idea of LSTMs

The key is the state of the cell, which is indicated in the diagram as the line that travels across the top:

Figure by Author

The state of the cell is like a kind of conveyor belt that travels along with the whole architecture of the network with very few interactions (and they are linear): this implies that the information simply flows without being modified.

The ingenious part is that the layers of the LSTM can (or cannot) contribute information to this conveyor belt, and that decision is made by the “gates”:

Figure by Author

The gates are nothing more than a way of carefully regulating the information that arrives on the conveyor belt. They are composed of a neural network with sigmoid-type activation and elemental multiplication.

Thus, the sigmoid layer outputs a number between 0 and one, which implies how important that information is to let it pass to the conveyor belt. 0 means I don’t care, and a 1 means it’s very important.

As you can see in the diagram, an LSTM has 3 such doors, to protect and control the conveyor belt.

The specific details about this operation, are greatly explained here:

And this blog is also very interesting:

With this in mind, let’s see what Recurring Networks can do!

LSTM Implementation

Image Classification with LSTM

We’ll follow an example that can be found here:

from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.datasets import mnist
from keras.utils import np_utils
from keras import initializers

# Hyper parameters
batch_size = 128
nb_epoch = 10
# Parameters for MNIST dataset
img_rows, img_cols = 28, 28
nb_classes = 10
# Parameters for LSTM network
nb_lstm_outputs = 30
nb_time_steps = img_rows
dim_input_vector = img_cols
# Load MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print('X_train original shape:', X_train.shape)
input_shape = (nb_time_steps, dim_input_vector)
X_train = X_train.astype('float32') / 255.
X_test = X_test.astype('float32') / 255.
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
# LSTM Building
model = Sequential()
model.add(LSTM(nb_lstm_outputs, input_shape=input_shape))
model.add(Dense(nb_classes, activation='softmax'))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
# Training the model
history =,
validation_data=(X_test, Y_test),
# Evaluation
evaluation = model.evaluate(X_test, Y_test, batch_size=batch_size, verbose=1)
print('Summary: Loss over the test dataset: %.2f, Accuracy: %.2f' % (evaluation[0], evaluation[1]))

Time Series Prediction with LSTM

# LSTM for international airline passengers problem with regression framing
import numpy
import matplotlib.pyplot as plt
from pandas import read_csv
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
dataX, dataY = [], []
for i in range(len(dataset)-look_back-1):
a = dataset[i:(i+look_back), 0]
dataY.append(dataset[i + look_back, 0])
return numpy.array(dataX), numpy.array(dataY)
# fix random seed for reproducibility
# load the dataset
dataframe = read_csv('international-airline-passengers.csv', usecols=[1], engine='python', skipfooter=3)
dataset = dataframe.values
dataset = dataset.astype('float32')
# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
# split into train and test sets
train_size = int(len(dataset) * 0.67)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
# reshape into X=t and Y=t+1
look_back = 1
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = numpy.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.compile(loss='mean_squared_error', optimizer='adam'), trainY, epochs=100, batch_size=1, verbose=2)
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])
# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))
# shift train predictions for plotting
trainPredictPlot = numpy.empty_like(dataset)
trainPredictPlot[:, :] = numpy.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = numpy.empty_like(dataset)
testPredictPlot[:, :] = numpy.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict
# plot baseline and predictions

Final Words

As always, I hope you enjoyed the post, and that you gained an intuition about RNNs and how to implement them!

If you liked this post then you can take a look at my other posts on Data Science and Machine Learning here.

If you want to learn more about Machine Learning, Data Science and Artificial Intelligence follow me on Medium, and stay tuned for my next posts!