AI Writing Poems: Building LSTM model using PyTorch

Original article was published on Deep Learning on Medium

AI Writing Poems: Building LSTM model using PyTorch

Hello everyone !! In this article we will build an model to predict next word in a paragraph using PyTorch. First we will learn about RNN and LSTM and how they work. Then we will create our model. First of all, we load our data and pre-process it. Then we will use PyTorch to train the model and save it. After that we will make prediction from that model by giving it a one starting text and by using that it will generate the complete paragraph.

What is RNN?

In machine learning simple problem like classifying an image of dog or cat can be solved by simply training it on the set of data and by using a classifier. But what if our problem is more complex like for example we have to predict next word in the paragraph. So if we examine this problem closely we will find that we human don’t solve this problem by only using our exiting knowledge of language and grammar.In this type of problem we use previous word of the paragraph and context of the paragraph to predict the next word.

Traditional neural network can’t do this because they are trained on fixed set of data and then used to make prediction. RNN are used to solve this type of problems. RNN stands for the Recurrent Neural Network. We can think of RNN as a neural network with loop in it. It passes the information of one state to the next state. So information persist in the process and we can use that to understand the previous context and make accurate predictions.

What is LSTM?

So if we solve problem of sequence of data where previous context is used by using RNN then why do we need LSTM? To answer this questions we have to look at this 2 examples.

Example 1: “Birds live in the nest.” Here it is easy to predict word “nest” because we have previous context of bird and RNN will work fine in this case.

Example 2:

“I grew up in India….. so i can speak Hindi” so here the task of predicting the word “Hindi” is difficult for an RNN because here gap between context is large. By looking at the line “i can speak …” we can’t predict the language we will need extra context of India. So here we will need some long term dependency to our paragraph so we can understand the context.

For this purpose we use LSTM(Long Short Term Memory). As the name suggest they have long term and short term memory(gate) and both are used in the conjunction to make predictions. If we talk about the architecture of the LSTM they contain 4 gates namely learn gate, forget gate, remember gate, use gate. To keep this article simple and hands on i am not going into theory of architecture of LSTM. But maybe we will talk about it in upcoming articles.(maybe in next one 😉).

Let’s Build Our Model.

So now we are done with the theory let’s start the interesting part — Building our model.

Loading and Pre-Processing the Data

I will use poetry dataset from Kaggle. It has total 15,000 poetry so it will be enough for our model to learn and create pattern. Now lets start loading it in our notebook.

1. First import the libraries.

import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

2. Now load data from the text file.

# open text file and read in data as `text`
with open('/data/poems_data.txt', 'r') as f:
text =

3. We can verify our data by printing first 100 character.


4. As we know our neural network does not understand text so we have to convert our txt data to the integer. For this purpose we can can create token dictionary and map character to integer and vice versa.

# encode the text and map each character to an integer and vice versa# we create two dictionaries:
# 1. int2char, which maps integers to characters
# 2. char2int, which maps characters to unique integers
chars = tuple(set(text))
int2char = dict(enumerate(chars))
char2int = {ch: ii for ii, ch in int2char.items()}
# encode the text
encoded = np.array([char2int[ch] for ch in text])

5. One hot encoding is used to represent the character. like if we have three character a,b,c then we can represent them in this way [1,0,0], [0,1,0], [0,0,1] here we use 1 to represent that character and all other will be 0. For our use case we have many character and symbol so our one-hot vector will be long. But it’s fine.

def one_hot_encode(arr, n_labels):

# Initialize the the encoded array
one_hot = np.zeros((arr.size, n_labels), dtype=np.float32)

# Fill the appropriate elements with ones
one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.

# Finally reshape it to get back to the original array
one_hot = one_hot.reshape((*arr.shape, n_labels))

return one_hot

Now test it in this way.

# check that the function works as expected
test_seq = np.array([[0, 5, 1]])
one_hot = one_hot_encode(test_seq, 8)

6. Now we have to create batches for our model and it’s very crucial part. In this we will choose a batch size which is number of row and then sequence length is how many column will be used in on batch.

def get_batches(arr, batch_size, seq_length):
'''Create a generator that returns batches of size
batch_size x seq_length from arr.

arr: Array you want to make batches from
batch_size: Batch size, the number of sequences per batch
seq_length: Number of encoded chars in a sequence

batch_size_total = batch_size * seq_length
# total number of batches we can make
n_batches = len(arr)//batch_size_total

# Keep only enough characters to make full batches
arr = arr[:n_batches * batch_size_total]
# Reshape into batch_size rows
arr = arr.reshape((batch_size, -1))

# iterate through the array, one sequence at a time
for n in range(0, arr.shape[1], seq_length):
# The features
x = arr[:, n:n+seq_length]
# The targets, shifted by one
y = np.zeros_like(x)
y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+seq_length]
except IndexError:
y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
yield x, y we can check if GPU is available.(if GPU is not available keep the epochs number low)

# check if GPU is available
train_on_gpu = torch.cuda.is_available()
print('Training on GPU!')
print('No GPU available, training on CPU; consider making n_epochs very small.')

8. Here we have created a class called CharRNN. It’s our class for model. In the init method we have to define the layers for our model. Here we are using two LSTM layers. We are also using dropout(it helps in avoiding overfitting). For the output we using simple linear layer.

class CharRNN(nn.Module):

def __init__(self, tokens, n_hidden=256, n_layers=2,
drop_prob=0.5, lr=0.001):
self.drop_prob = drop_prob
self.n_layers = n_layers
self.n_hidden = n_hidden = lr

# creating character dictionaries
self.chars = tokens
self.int2char = dict(enumerate(self.chars))
self.char2int = {ch: ii for ii, ch in self.int2char.items()}

#lstm layer

#dropout layer

#output layer

def forward(self, x, hidden):
''' Forward pass through the network.
These inputs are x, and the hidden/cell state `hidden`. '''
## Get the outputs and the new hidden state from the lstm
r_output, hidden = self.lstm(x, hidden)

## pass through a dropout layer
out = self.dropout(r_output)

# Stack up LSTM outputs using view
# you may need to use contiguous to reshape the output
out = out.contiguous().view(-1, self.n_hidden)

## put x through the fully-connected layer
out = self.fc(out)
return out, hidden

def init_hidden(self, batch_size):
''' Initializes hidden state '''
# Create two new tensors with sizes n_layers x batch_size x n_hidden,
# initialized to zero, for hidden state and cell state of LSTM
weight = next(self.parameters()).data

if (train_on_gpu):
hidden = (, batch_size, self.n_hidden).zero_().cuda(),, batch_size, self.n_hidden).zero_().cuda())
hidden = (, batch_size, self.n_hidden).zero_(),, batch_size, self.n_hidden).zero_())

return hidden

9. Now we have our model, it’s time to train the model. For training we have to use optimiser and loss function. We simply calculate the loss after each step then optimizer step function back propagate it and wights are changed appropriately. Loss will slowly decrease and it means that are model is getting better.

We also use validation between the epochs to get the validation loss because so we can decide if our model is under-fitting or over-fitting.

def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
''' Training a network


net: CharRNN network
data: text data to train the network
epochs: Number of epochs to train
batch_size: Number of mini-sequences per mini-batch, aka batch size
seq_length: Number of character steps per mini-batch
lr: learning rate
clip: gradient clipping
val_frac: Fraction of data to hold out for validation
print_every: Number of steps for printing training and validation loss


opt = torch.optim.Adam(net.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

# create training and validation data
val_idx = int(len(data)*(1-val_frac))
data, val_data = data[:val_idx], data[val_idx:]


counter = 0
n_chars = len(net.chars)
for e in range(epochs):
# initialize hidden state
h = net.init_hidden(batch_size)

for x, y in get_batches(data, batch_size, seq_length):
counter += 1

# One-hot encode our data and make them Torch tensors
x = one_hot_encode(x, n_chars)
inputs, targets = torch.from_numpy(x), torch.from_numpy(y)

inputs, targets = inputs.cuda(), targets.cuda()
# Creating new variables for the hidden state, otherwise
# we'd backprop through the entire training history
h = tuple([ for each in h])
# zero accumulated gradients

# get the output from the model
output, h = net(inputs, h)

# calculate the loss and perform backprop
loss = criterion(output, targets.view(batch_size*seq_length).long())
# `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
nn.utils.clip_grad_norm_(net.parameters(), clip)

# loss stats
if counter % print_every == 0:
# Get validation loss
val_h = net.init_hidden(batch_size)
val_losses = []
for x, y in get_batches(val_data, batch_size, seq_length):
# One-hot encode our data and make them Torch tensors
x = one_hot_encode(x, n_chars)
x, y = torch.from_numpy(x), torch.from_numpy(y)

# Creating new variables for the hidden state, otherwise
# we'd backprop through the entire training history
val_h = tuple([ for each in val_h])

inputs, targets = x, y
inputs, targets = inputs.cuda(), targets.cuda()
output, val_h = net(inputs, val_h)
val_loss = criterion(output, targets.view(batch_size*seq_length).long())


net.train() # reset to train mode after iterationg through validation data

print("Epoch: {}/{}...".format(e+1, epochs),
"Step: {}...".format(counter),
"Loss: {:.4f}...".format(loss.item()),
"Val Loss: {:.4f}".format(np.mean(val_losses)))

now train it in the following way.

# define and print the net
n_hidden = 512
n_layers = 2
net = CharRNN(chars, n_hidden, n_layers)
batch_size = 128
seq_length = 100
n_epochs = 10# start small if you are just testing initial behavior
# train the model
train(net, encoded, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=10)

10. We can save the model in following way.

# change the name, for saving multiple files
model_name = ''
checkpoint = {'n_hidden': net.n_hidden,
'n_layers': net.n_layers,
'state_dict': net.state_dict(),
'tokens': net.chars}
with open(model_name, 'wb') as f:, f)

11. Now that the model is trained, we’ll want to sample from it and make predictions about next characters! To sample, we pass in a character and have the network predict the next character. Then we take that character, pass it back in, and get another predicted character. Just keep doing this and you’ll generate a bunch of text! Here top k sample is just number of letter that our model will predict and use most relevant one from it.

def predict(net, char, h=None, top_k=None):
''' Given a character, predict the next character.
Returns the predicted character and the hidden state.

# tensor inputs
x = np.array([[net.char2int[char]]])
x = one_hot_encode(x, len(net.chars))
inputs = torch.from_numpy(x)

inputs = inputs.cuda()

# detach hidden state from history
h = tuple([ for each in h])
# get the output of the model
out, h = net(inputs, h)
# get the character probabilities
p = F.softmax(out, dim=1).data
p = p.cpu() # move to cpu

# get top characters
if top_k is None:
top_ch = np.arange(len(net.chars))
p, top_ch = p.topk(top_k)
top_ch = top_ch.numpy().squeeze()

# select the likely next character with some element of randomness
p = p.numpy().squeeze()
char = np.random.choice(top_ch, p=p/p.sum())

# return the encoded value of the predicted char and the hidden state
return net.int2char[char], h
def sample(net, size, prime='The', top_k=None):


net.eval() # eval mode

# First off, run through the prime characters
chars = [ch for ch in prime]
h = net.init_hidden(1)
for ch in prime:
char, h = predict(net, ch, h, top_k=top_k)

# Now pass in the previous character and get a new one
for ii in range(size):
char, h = predict(net, chars[-1], h, top_k=top_k)
return ''.join(chars)

12. Now we can use this sample method to make prediction.

print(sample(net, 500, prime='christmas', top_k=2))

and output will look something like this.

christmas a son of thisthe sun wants the street of the stars, and the way the way
they went and too man and the star of the words
of a body of a street and the strange shoulder of the sky
and the sun, an end on the sun and the sun and so to the stars are stars
and the words of the water and the streets of the world
to see them to start a posture of the streets
on the street of the streets, and the sun and soul of the station
and so too too the world of a sound and stranger and to the world
to the sun a

As we can see our model was able to generate some good lines. Content doesn’t make a lot of sense but it was able to generate some lines with correct grammar. If we can train it for some more time it can perform even better.


In this article, we have learned about RNN and LSTM. We have also build our poem model with PyTorch. I hope you have found this article helpful. If you have any queries or suggestions feel free to post it in the comment section below or contact me at , I will be really glad to assist you.