Generating Rick and Morty Episodes

Source: Deep Learning on Medium


[Image [1] source: https://www.kisspng.com/png-rick-sanchez-rick-and-morty-season-3-adult-swim-ri-829096/preview.html]

A practical guide to training RNNs for language modelling using PyTorch by using Natural Language Processing and Deep Learning to generate Rick and Morty Scripts.


We will first import the dependencies required to process our data and build our model.

import os
import pickle
import numpy as np
from collections import Counter
from gensim.models import Word2Vec
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from tensorboardX import SummaryWriter

Recurrent Neural Networks are pretty heavy and ideally you would want to train your RNNs on a GPU.

train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
print('No GPU found. Training will happen on CPU and may take a long time.')

Vocabulary

Since we are training a word-level RNN, we first need to create a vocabulary of words that our model will use. This vocabulary is built using the words present in out training data.

We replace all punctuation with corresponding tokens for better processing of our text data.

class Vocabulary(object):
"""
Wrapper class for vocabulary
"""
def __init__(self):
self._word2idx = {}
self._idx2word = {}
self._counter = Counter()
self._size = 0
self._punctuation2token = {';': "<semicolon>",
':': "<colon>",
"'": "<inverted_comma>",
'"': "<quotation_mark>",
',': "<comma>",
'\n': "<new_line>",
'!': "<exclamation_mark>",
'-': "<hyphen>",
'--': "<hyphens>",
'.': "<period>",
'?': "<question_mark>",
'(': "<left_paren>",
')': "<right_paren>",
'♪': "<music_note>",
'[': "<left_square>",
']': "<right_square>",
"’": "<inverted_comma>",
}
self.add_text('<pad>')
self.add_text('<unknown>')
def add_word(self, word):
"""
Adds a token to the vocabulary
:param word: (str) word to add to vocabulary
:return: None
"""
word = word.lower()
if word not in self._word2idx:
self._idx2word[self._size] = word
self._word2idx[word] = self._size
self._size += 1
self._counter[word] += 1
def add_text(self, text):
"""
Splits text into tokens and adds to the vocabulary
:param text: (str) text to add to vocabulary
:return: None
"""
text = self.clean_text(text)
tokens = self.tokenize(text)
for token in tokens:
self.add_word(token)
def clean_text(self, text):
"""
Cleans text for processing
:param text: (str) text to be cleaned
:return: (str) cleaned text
"""
text = text.lower().strip()
for key, token in self._punctuation2token.items():
text = text.replace(key, ' {} '.format(token))
text = text.strip()
while ' ' in text:
text = text.replace(' ', ' ')
return text
def tokenize(self, text):
"""
Splits text into individual tokens
:param text: (str) text to be tokenized
:return: (list) list of tokens in text
"""
return text.split(' ')
def set_vocab(self, vocab):
self._word2idx = {}
self._idx2word = {}
self._counter = Counter()
self._size = 0
self.add_text('<pad>')
self.add_text('<unknown>')
for word in vocab:
self.add_word(word)

def most_common(self, n):
"""
Creates a new vocabulary object containing the n most frequent tokens from current vocabulary
:param n: (int) number of most frequent tokens to keep
:return: (Vocabulary) vocabulary containing n most frequent tokens
"""
tmp = Vocabulary()
for w in self._counter.most_common(n):
tmp.add_word(w[0])
tmp._counter[w[0]] = w[1]
return tmp
def load(self, path='vocab.pkl'):
"""
Loads vocabulary from given path
:param path: (str) path to pkl object
:return: None
"""
with open(path, 'rb') as f:
self.__dict__.clear()
self.__dict__.update(pickle.load(f))
print("\nVocabulary successfully loaded from [{}]\n".format(path))
def save(self, path='vocab.pkl'):
"""
Saves vocabulary to given path
:param path: (str) path where vocabulary should be stored
:return: None
"""
with open(path, 'wb') as f:
pickle.dump(self.__dict__, f)
print("\nVocabulary successfully stored as [{}]\n".format(path))
def add_punctuation(self, text):
"""
Replces punctuation tokens with corresponding characters
:param text: (str) text to process
:return: text with punctuation tokens replaced with characters
"""
for key, token in self._punctuation2token.items():
text = text.replace(token, ' {} '.format(key))
text = text.strip()
while ' ' in text:
text = text.replace(' ', ' ')
text = text.replace(' :', ':')
text = text.replace(" ' ", "'")
text = text.replace("[ ", "[")
text = text.replace(" ]", "]")
text = text.replace(" .", ".")
text = text.replace(" ,", ",")
text = text.replace(" !", "!")
text = text.replace(" ?", "?")
text = text.replace(" ’ ", "’")
return text
def __len__(self):
"""
Number of unique words in vocabulary
"""
return self._size
def __str__(self):
s = "Vocabulary contains {} tokens\nMost frequent tokens:\n".format(self._size)
for w in self._counter.most_common(10):
s += "{} : {}\n".format(w[0], w[1])
return s
def __getitem__(self, item):
"""
Returns the word corresponding to an id or and id corresponding to a word in the vocabulary.
Return <unknown> if id/word is not present in the vocabulary
"""
if isinstance(item, int):
return self._idx2word[item]
elif isinstance(item, str):
if item in self._word2idx:
return self._word2idx[item]
else:
return self._word2idx['<unknown>']
return None

Word2Vec Embedding

Word2vec is a group of related models that are used to produce word embedding. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

We create word embedding for our data using word2vec and have the option of replacing the weights of the embedding layer in our model with these word embedding instead.

Using word2vec to learn word vectors for the corpus is optional. If word2vec embedding is used, the embedding layer in MortyFire will be frozen and gradient updates will not be made.

On the other hand, you have the option of randomly initializing the embedding layer in MortyFire and learning the word vector while training.

with open('data/rick_and_morty.txt', 'r') as f:
text = f.readlines()
vocab = Vocabulary()
sentences = []
for sentence in text:
sentence = vocab.clean_text(sentence)
sentence = vocab.tokenize(sentence) + [vocab._punctuation2token['\n']]
sentences.append(sentence)
model = Word2Vec(sentences, size=300, window=11, min_count=1, workers=4)
print(model)
model.save('data/word2vec.bin')
print("Word2Vec model saved as [data/word2vec.bin}]")
words = list(model.wv.vocab)
vocab.set_vocab(words)
embed_size = model.layer1_size
embeddings = np.zeros((len(vocab), embed_size), dtype=np.float32)
embeddings[vocab['<pad>']] = 0.0
embeddings[vocab['<unknown>']] = np.random.uniform(-0.1, 0.1, embed_size)
for idx in range(2, len(vocab)):
embeddings[idx] = model[vocab[idx]]
vocab.save('data/vocab.pkl')
np.save('data/embeddings.npy', embeddings)
print("Embeddings saved as [data/embeddings.npy}]")

MortyFire

MortyFire is our recurrent neural network which uses LSTM units to generate Rick and Morty scripts.

The model has 3 layers:

  • Embedding: The embedding layer is used to learn embedding for words present in our vocabulary. Each word is replaced with its corresponding word vector from the embedding layer and passed through the network.
  • LSTM: The lstm layer goes through the word vector for each word in the input text sequence at each time step. The output of the last time step is a vector which encodes the entire input sequences.
  • Linear: The encoded output from the last time step of the lstm layer is then passed to a fully-connected layer which outputs a probability distribution over all the words present in our vocabulary to find the most suitable candidate for the next word in the sequence.
class MortyFire(nn.Module):

""" Wrapper class for text generating RNN """

def __init__(self, vocab_size, embed_size, lstm_size, seq_length, num_layers, dropout=0.5, bidirectional=False,
train_on_gpu=True, embeddings=None):
nn.Module.__init__(self)
 self.vocab_size = vocab_size
self.num_layers = num_layers
self.lstm_size = lstm_size
self.seq_length = seq_length
self.embed_size = embed_size
self.train_on_gpu = train_on_gpu
self.bidirectional = bidirectional
self.embedding = nn.Embedding(vocab_size, embed_size)
if embeddings is not None:
self.embedding.weight = nn.Parameter(torch.from_numpy(embeddings))
 self.embedding.weight.requires_grad = False
self.lstm = nn.LSTM(embed_size, lstm_size, num_layers, dropout=dropout, batch_first=True,
bidirectional=bidirectional)
 self.dropout = nn.Dropout(dropout) 
 self.fc = nn.Linear(lstm_size * 2, vocab_size)
def forward(self, batch, hidden):
batch_size = batch.size(0)
embeds = self.embedding(batch)
lstm_out, hidden = self.lstm(embeds, hidden)
lstm_out = lstm_out.contiguous().view(-1, self.lstm_size * 2)
drop = self.dropout(lstm_out)
output = self.fc(drop)
output = output.view(batch_size, -1, self.vocab_size)
out = output[:, -1]
return out, hidden
def init_hidden(self, batch_size):
weight = next(self.parameters()).data
layers = self.num_layers if not self.bidirectional else self.num_layers * 2
if self.train_on_gpu:
hidden = (weight.new(layers, batch_size, self.lstm_size).zero_().cuda(),
weight.new(layers, batch_size, self.lstm_size).zero_().cuda())
else:
hidden = (weight.new(layers, batch_size, self.lstm_size).zero_(),
weight.new(layers, batch_size, self.lstm_size).zero_())
return hidden

Data Set

The RickAndMortyDataset class produces text sequences from the training data set, where each word in each sequence is replaced by its corresponding index value in the vocabulary.

The data set can be downloaded here.

class RickAndMortyData(Dataset):

""" Wrapper class to process and produce training samples """

def __init__(self, text, seq_length, vocab=None):
self.text = text
self.seq_length = seq_length
if vocab is None:
self.vocab = Vocabulary()
self.vocab.add_text(self.text)
else:
self.vocab = vocab
self.text = self.vocab.clean_text(text)
self.tokens = self.vocab.tokenize(self.text)
def __len__(self):
return len(self.tokens) - self.seq_length
def __getitem__(self, idx):
x = [self.vocab[word] for word in self.tokens[idx:idx + self.seq_length]]
y = [self.vocab[self.tokens[idx + self.seq_length]]]
x = torch.LongTensor(x)
y = torch.LongTensor(y)
return x, y

Script Generation

Temperature: Temperature is a measure of how much diversity should be introduced in the predictions of the model. This means the model will be more adventurous in its predictions which also means it may produce more mistakes.

The _pick_word function uses the temperature hyperparameter to make relatively more adventurous predictions, instead of always choosing the word with the highest score.

def _pick_word(probabilities, temperature):
"""
Pick the next word in the generated text
:param probabilities: Probabilites of the next word
:return: String of the predicted word
"""

probabilities = np.log(probabilities) / temperature
exp_probs = np.exp(probabilities)
probabilities = exp_probs / np.sum(exp_probs)
pick = np.random.choice(len(probabilities), p=probabilities)
while int(pick) == 1:
pick = np.random.choice(len(probabilities), p=probabilities)
return pick

The generate function uses the _pick_word function to generate scripts from an initial input string given by the user.

def generate(model, start_seq, vocab, length=100, temperature=1.0):
model.eval()
tokens = vocab.clean_text(start_seq)
tokens = vocab.tokenize(tokens)
# create a sequence (batch_size=1) with the prime_id
current_seq = np.full((1, model.seq_length), vocab['<pad>'])
for idx, token in enumerate(tokens):
current_seq[-1][idx - len(tokens)] = vocab[token]
predicted = tokens
for _ in range(length):
if train_on_gpu:
current_seq = torch.LongTensor(current_seq).cuda()
else:
current_seq = torch.LongTensor(current_seq)
hidden = model.init_hidden(current_seq.size(0))
output, _ = model(current_seq, hidden)
p = torch.nn.functional.softmax(output, dim=1).data
if train_on_gpu:
p = p.cpu()
probabilities = p.numpy().squeeze()
word_i = _pick_word(probabilities, temperature)
# retrieve that word from the dictionary
word = vocab[int(word_i)]
predicted.append(word)
# the generated word becomes the next "current sequence" and the cycle can continue
current_seq = current_seq.cpu().data.numpy()
current_seq = np.roll(current_seq, -1, 1)
current_seq[-1][-1] = word_i
gen_sentences = ' '.join(predicted)
gen_sentences = vocab.add_punctuation(gen_sentences)
return gen_sentences

Hyperparameters

Feel free to play around with the hyperparameters to train your own model

  • Epochs: Number of times the model should go through the training data. (3–5 epochs should be good enough for testing)
  • Batch Size: Number of sequences in one batch of data (Depends on size of model, sequence length, RAM avaialble)
  • LSTM Size: Number of neurons in the lstm layer
  • Sequence Length: Length of one training sequence from the text dataset
  • LSTM Layers: Number of lstm layers in the model
  • Bidirectional: To enable bidirectional training on the input sequences
  • Embedding Size: Size of the embeddings for words in our vocabulary
  • Dropout: Probability of dropping neurons to prevent overfitting
  • Learning Rate: Initial learning rate for our optimizer
data_path = 'data/rick_and_morty.txt'
checkpoint_dir = 'checkpoints/'
epochs = 14
batch_size = 256
lstm_size = 256
seq_length = 20
num_layers = 2
bidirectional = False
embeddings_size = 300
dropout = 0.5
learning_rate = 0.001
with open(data_path, 'r') as f:
text = f.read()

If you’re not using word2vec, use the following code to create a vocabulary from the text corpus:

vocab = Vocabulary()
vocab.add_text(text)

Use the following line to see what your vocabulary looks like:

print(vocab)

Building MortyFire model with the hyperparameters set above

model = MortyFire(vocab_size=len(vocab), lstm_size=lstm_size, embed_size=embeddings_size, seq_length=seq_length,
num_layers=num_layers, dropout=dropout, bidirectional=bidirectional, train_on_gpu=train_on_gpu, embeddings=embeddings)
if train_on_gpu:
model.cuda()
print(model)

Let’s begin training MortyFire:

if not os.path.isdir(checkpoint_dir):
os.makedirs(checkpoint_dir)
dataset = RickAndMortyData(text=text, seq_length=seq_length, vocab=vocab)
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
writer = SummaryWriter()
parameters = [param for param in model.parameters() if param.requires_grad == True]
optimizer = torch.optim.Adam(parameters, lr=learning_rate)
criterion = nn.CrossEntropyLoss()
losses = []
batch_losses = []
global_step = 0
print("\nInitializing training...")
for epoch in range(1, epochs + 1):
print("Epoch: {:>4}/{:<4}".format(epoch, epochs))
model.train()
hidden = model.init_hidden(batch_size)
for batch, (inputs, labels) in enumerate(data_loader):
labels = labels.reshape(-1)
if labels.size()[0] != batch_size:
break
h = tuple([each.data for each in hidden])
model.zero_grad()
if train_on_gpu:
inputs, labels = inputs.cuda(), labels.cuda()
output, h = model(inputs, h)
loss = criterion(output, labels)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), 5)
optimizer.step()
hidden = h
losses.append(loss.item())
batch_losses.append(loss.item())
if batch % 100 == 0:
print("step [{}/{}]\t\tloss: {:4f}".format(batch, len(dataset) // batch_size, np.average(batch_losses)))
writer.add_scalar('loss', loss, global_step)
batch_losses = []
global_step += 1
print("\n----- Generating text -----")
for temperature in [0.2, 0.5, 1.0]:
print('----- Temperatue: {} -----'.format(temperature))
print(generate(model, start_seq='rick:', vocab=vocab, temperature=temperature, length=100))
print()
torch.save(model.state_dict(),
os.path.join(checkpoint_dir, "mortyfire-{}-{:04f}.model".format(epoch, np.average(losses))))
losses = []
writer.close()
print("\nSaving model [{}]".format('mortyfire'))
torch.save(model.state_dict(), 'mortyfire')

Script Generation:

model_path = 'mortyfire'
model.load_state_dict(torch.load(model_path))
start_sequence = "rick: (burps)"
temperature = 0.8
script_length = 1000
script = generate(model, start_seq=start_sequence, vocab=vocab, temperature=temperature, length=script_length)
print('----- Temperatue: {} -----'.format(temperature))
print(script)

The model we trained produced some interesting (if slightly crass) scripts. Here’s an example:

----- Temperatue: 0.8 -----
rick: ( burps ) 
morty: fuck you, grandpa rick! you're an asshole. you know, y - y - y - you're just a good thing, morty, let's the victim dance, 
jerry: well, i m sorry, 
tammy: you're gonna have more more learning, no? what did you think? i mean, i don't mean what? i don't just have, you know? 
[morty walks out a laser out of the train and gets to the door. he walks out in the wall and fires a button a summer.] 

rick: that is a little of the chillest, rick.
rick: you think that s the way of this? you know what? i mean, you think i'm doing you're even doctor. rick's castle - dor schplern. 
[int. planet station, corridor - day] principal is wearing morty.)

morty: morty, that s too, about a perfect - burp, tiny - b - - you never to please!
[beth starts the portal gun at a cage of the bathroom] 
rick: whoa! 
rick: hello, yeah, rick, you're like a little minimum's daughter, but you re getting, the fact have been come. 
beth: oh, geez, jerry! don't you even say that all of your name? 
morty: yeah, i think i'm a sister. 
morty: no, i - i don t die! 
summer: summer! the crowd jericole a little drive! 
rick: ( sarcastic ) is he is about a bit! i gotta a bet in my birthday. 
jerry: looks up the bastard. ahh, a good, morty. morty takes a button. 
rick: jesus, don t worry. 
( he passes a button that )
morty: they did you. where do you doing that? 
rick: wh - wh - what - that's what makes wanted to come on, rick? 
summer: dad, the word of your sister, i'll get my grandkids, but i'm gonna go to shoot! 
rick: you know, i know. i mean, i can t believe, but what the hell? i can i just abandoned you! 
summer: you don t be here. jerry is staring by 
 [ext. jerry's room - flashback] 
a golden warrior trapped together in the wall. the presidential breaks to the living dome. luke is standing at her head. 
beth: dammit - - what are you doing? 
beth: no two. i m going to be sick. i ll handle the watch of your fruit. 
sandy: morty! 
morty: you, you know what you think you can get a drink? 
morty: uh, yeah, i'm not gonna say it's got, morty. how long do we check at me? 
morty: we're a list - - and - - 
summer: i didn't do, i m like your eyes. 
rick: oh, it is a little bit. work is, you're right over, you know? 
rick: no, summer. that's a bad built, morty... 
beth: that's this, i'm mr. meeseeks! 
lincoler: that is there, i'm not it call, but you're gonna get you to help me into a pair. 
summer: what? 
morty: no, you have. 
morty: ( hugging ) you shouldn t believe - that is really much and stuff, 
jerry: dad. you're, you know what i think i might morty? 
jerry: yeah, i can - - i m looking! 
summer: i don t know it was a gun on. i don't know that, morty, i ve been a lot and old, 
jerry: i'm gonna have that, but i'm up it! 
morty: it's okay, 

[rick and morty 4 and face]

summer: jerry, beth, i told him,

summer: [groaning]

jerry: morty! go on, take, look, rick, ( laughs ) and then wants. " s your job? [chuckles] " king? " i did do you re good!

morty: [sneezes] oh! god,

rick: i like, we're going to go!

morty: there's pretty, i'm sorry. i m listening to sleep the mailman of you.

morty: shut at me. fine. i always last.

morty: well, why this are a human tree i ve been to do summer!

rick: what - a - - of course, you like me, morty.

rick: what are you gonna do grandpa?

beth: yeah, i know it s once he like.

The scripts don’t completely make sense. This can be attributed to the fact that the text corpus we’re training on is tiny.

Rick and Morty has only 3 seasons with about 30 episodes, giving us a total of about 15,000 lines of data. This is nowhere enough the amount of data that is required for a language based neural network to produce some legible results.

Still, given the small size of our data set and our very primitive RNN, the results a impressive!

Let’s hope several more seasons of Rick and Morty come out soon so we have something to watch and more data to train on.

Till then, wubbalubba dub dub, ya crazies!

[Image [2] source: https://i.imgur.com/1nEkmpP.png]

Thank you for reading. Constructive criticism welcome. To know more about DSC Manipal, check out our fb page at DSC Manipal.