Teaching LSTMs to play God

Source: Deep Learning on Medium

Teaching LSTMs to play God

Photo by Aaron Burden on Unsplash

At the risk of this becoming another one of those “Generate text with an RNN” tutorials, and at the thought of Shakespeare and Lewis Carrol cringing in their graves at the kinds of texts these vanilla models normally generate, I chose to take a different route, but at the same time decided to take a dig at this.

The Holy Bible is the biggest selling and the most influential book in human history. Vast treasure troves of human thought, exposition, and creation has been centred around this book.

Which begs the question — Can a recurrent neural network be trained well to generate sermons?

In this toy example, we will walk through relatively simple lines of code to attempt this. Our expectations aren’t high, as we’ll be using a very simple architecture. The goal of this post is to learn how to do text generation with RNNs in tensorflow. I do assume some background in training neural networks in tensor flow before, only to the point that I won’t explain how the loss function or the softmax is implemented, for example.

Versions: Python 3.6 and Tensorflow 1.15.0.

Let’s get started!

DATA

Since nothing in deep learning proceeds without data, let’s get some. Project Gutenberg has some marvellous books for free public use [how ironic, since the Gutenberg Bible was the first printed text with movable type]. We’ll get the King James Version here.

PREPROCESSING

The structure of the Bible is relatively simpler than other books. Even then, we may encounter some weird characters that we don’t want fed into our model. Let’s have a look at the data:

We don’t want our model to learn characters like ‘\’, ‘{’, ‘\n’. In fact, let’s look at the unique characters in the text:

print(sorted(list(set(data)))

The numbers are important, since they indicate the verses, the other punctuations are okay too, but we can live without the following:

That’s better. There are 48 unique characters in this text.

We need to do two small things before we can proceed with the model. These characters cannot exist as they are. We need an integer encoding of them to be able to actually feed these as input arrays. Similarly, down the line, when we predict characters, we need a way to decode the integers obtained to characters again. So we can create two dictionaries, one that holds a one to one mapping of a character to an integer, and vice versa. Let’s do that:

`char_to_int = dict((i, char) for char, i in enumerate(unique_chars))
int_to_char = dict((char, i) for char, i in enumerate(unique_chars))`

Nothing much to explain till here, check that you have the following mappings:

Char to int
int to char

BATCH GENERATION

A lot of deep learning training happens to involve taking decision regarding the size of data, the shape, the structure, and so on. The text is too big to be fed at once, and most problems in the real world involve sizes orders bigger than the text we’re processing at the moment. Training in batches isn’t a nice-to-have. It’s necessary.

In particular, RNNs train using backpropagation through time [BPTT], which is basically the traditional backpropagation unrolled over each time step. A nice primer is this.

Since we can’t apply BPTT on the entire text, we apply it on the batches we generate. How do we generate these batches? It’s mostly a design question, but I’ve implemented the following procedure: [You can skip this part if you’re interested in just the code].

(1) Divide the entire text into 16 blocks.

(2) Each block contains sequences of characters. We choose 256 as our sequence size.

(3) Each batch i we create contains the ith sequence from each block. This. means each batch contains 16 sequences, each of size 256. This means Batch 1 has the first sequence of 256 characters from Block 1, Block 2,…., Block 16.The same for Batch 2, Batch 3, …, and so on. How many batches do we have? For n characters in total, it’s standard middle school math to see —

n = batch_size * sequence_size * no_of_batches

Of course, this won’t always be wholly divisible, it depends on the four integer values chosen. For example, in our case, n = 4233042, batch_size = 16, sequence_size = 256, no_of_batches = 1034 but if you see carefully, the last batch cannot have sequence sizes of 256, but a smaller value [take out a pen and paper and try to figure out what this value will be], because we’ve run out of characters when we get to the last batch.

We can just drop this last batch to avoid shape mismatch issues later. We now have no_of_batches = 1033 instead, with all the arrays nicely shaped at (16, 256).

Okay, in summary, 1033 batches, each batch containing 16 sequences, each sequence 256 characters long.

By way, this entire process I’ve described has a name — Truncated Backpropagation Through Time. Lots of details here.

Here’s the code to do all the stuff I’ve rambled on about:

Batch generation

The next question is, how do we create the input and target values? This is simple. Consider the following example:

X -> the dog.

The target for this would be:

Y -> he dog.

Every character i in the target vector is the (i + 1)th character in the input vector. Notice how Y is a unit dimension smaller than X. This is because when you reach the last character of X, there’s nothing left to predict. Thus, we can simply remove the last character in X. This small observation is important, our final shapes of both X and Y will be (16, 255):

Creating the Dataset

Done! We’re now ready to build our model.

ARCHITECTURE and TRAINING

We will choose a simple architecture — Two hidden layers, one MultiRNNCell, each LSTM cell contains 256 hidden units, and a softmax output layer of k units, where k is the number of unique characters in our data [Makes sense, right?].

That’s it!

Model Architecture

This should be pretty self explanatory, except for the part where I’ve added one hot encoding for the inputs and labels. Notice that this transforms the shapes (16, 256) to (16, 256, k), k = 48 in our case. I choose a small number of epochs to check whether the training losses behave as they should — decrease gradually. You can always play around with these hyperparameters later on.

Let’s train our model:

Train for 5 epochs

Notice that you need to feed the final state at each time step t as the intial state for (t + 1). This is crucial.

We get the following loss curve:

Train Loss per epoch

Nice elbow-like behaviour. The losses are: [1.54, 1.16, 1.10, 1.07, 1.05]

GENERATE WORDS OF GOD

Time for the real fun. Let’s get this simple model to generate some text:

We’ll provide a start sequence, and ask the model to predict 256 characters after the start. Since our softmax returns probabilities of selecting each of the characters, we have flexibility in determining which character to choose. Always choosing the maximum probability characters makes the model repeat itself, behaving like an infinte loop that prints the same value over and over again. Instead, we sort the probabilites, take the five biggest values, renormalise, and then randomly select from among these five. This introduces a stochasticity that produces better results:

Words of God

Let’s look at what sermons our LSTMs spit out:

and god and the land that the lord,at there are will i will shall tentsouss which the lord, and shall bath and all the lord, which are the wilderness.2:22for they shall cause the people: and with that with helpet was soul of his fields: and it nations, which they

Hilarious! Here’s some more:

and god hath seen minister unto them.11:36 they that forsake them as a child of the congregation: the lord hath done it, and wine it: for the lord your god hath done.1:2 but i answered thee for me; the diviting of the lord god of heart thou sayest by thy light in

and god with the light.11:11 and i said unto him, this marmer the land wourd in me, why have we have not believed you, and in the midst of the enemy;2:12 but thus saith, master, i will set them all, and send to thee, and setthem inthe way of the congregation, that

Almost all of it is semantically nonsense, but note that the model did not have any information of what words or letters or numbers are, what punctuations are, what structure is, what language is, for that matter. Trained for 5 epochs, we can see some interesting results. It has learnt that a lot of text begins with the digits, for example. It does add the right structure of the digits and colons before a new sermon. It has also learnt to put in punctuation here and there, and mostly gets words right, as far as the vocabulary is concerned.

Let’s try a different start sequence, just for fun. Here’s what our model came up with for ‘jerusalem’:

jerusalem: but the levite, whom itshall be searated and shall put it into curse.5:16 i have set up a parable of thee, but is the land, and on them: and there is no bringing the service ofhim.61:11 they are as the holy ghost, of whom the word oftheir fathers.21:26 t

jerusalem, and the throne, and the the three house of the horse, and the priest’s servant tola which all things are come to the saying.12:4 for the waters shall be on the sepulling of theshout of the house, of the lord, their charges of them.12:1 and they which are

jerusalem, and they also the father.1:10 and there was strength against the lord, and the sameone,what is the word of the lord, and the hangings of idols: and the saying will i serve their own labour.1:11 i have death in this take upon the flesh from this land in s

Interesting, it has learnt pretty much that jerusalem is an unique entity, it isn’t joined with other characters.

Try this out yourself for different start sequences!

IMPROVEMENTS AND CLOSING COMMENTS

Improvements can be brought about in many ways. Tune the hyperparameters, specifically try increasing the epochs, slightly decreasing the learning rate, increasing the number of hidden units, and all combinations of these. One interesting hyperparameter that can be introduced is temperature, which determines how conservative/diverse the model choices are in picking the next characters. It’s explained nicely here.

I conclude with a question — If in enough time, enough data, and a smart model, we can generate sermons that are indistinguishable from human written ones [as we have done for paintings], can we program God? But if we could, haven’t we run into a paradox? I’d love a discussion on this.

Happy Deep Learning!

Code: https://github.com/rwiddhic96/LSTMS_God

References:

  1. https://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html
  2. https://www.tensorflow.org/tutorials/text/text_generation