How to Develop a Character-Based Neural Language Model

Source: Deep Learning on Medium

Recently, the use of neural networks in the development of language models has become very popular, to the point that it may now be the preferred approach. The use of neural networks in language modeling is often called Neural Language Modeling, or NLM for short. Neural network approaches are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation. A key reason for the leaps in improved performance may be the method’s ability to generalize.Nonlinear neural network models solve some of the shortcomings of traditional language models: they allow conditioning on increasingly large context sizes with only a linear increase in the number of parameters, they alleviate the need for manually designing back orders, and they support generalization across different contexts.

A language model predicts the next word in the sequence based on specific words that have come before it in the sequence. It is also possible to develop language models at the character level using neural networks. The benefit of character-based language models is their small vocabulary and flexibility in handling any words, punctuation, and other document structure. This comes at the cost of requiring larger models that are slower to train. Nevertheless, in the field of neural language models, character-based models offer a lot of promise for a general, flexible and powerful approach to language modeling.

A — Preparation of text for character-based language modeling

William Shakespeare THE SONNET is well known in the west. The first paragraph that we will use to develop our character-based language model. It is short, so fitting the model will be fast, but not so short that we won’t see anything interesting. The complete 4 verse version we will use as source text is listed below.

From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.

Copy the text and save it in a new file in your current working directory with the file name Shakespeare.txt.

Data Preparation

1> LOAD & CLEAN TEXT

We must load the text into memory so that we can work with it. Below is a function named load doc() that will load a text file given a filename and return the loaded text. Specifically ,we will strip all of the new line characters so that we have one long sequence of characters separated only by white space.

from numpy import array
from pickle import dump
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# load text
raw_text = load_doc('shakespeare.txt')
print(raw_text)
# clean
tokens = raw_text.split()
raw_text = ' '.join(tokens)

2> CREATE & SAVE SEQUENCES

we have a long list of characters, we can create our input-output sequences used to train the model. Each input sequence will be 12 characters with one output character, making each sequence 13 characters long. We can create the sequences by enumerating the characters in the text, starting at the 13th character at index 12. we can see that we end up with just under 597 sequences of characters for training our language model.

We can call this function and save our prepared sequences to the filename char sequences.txt in our current working directory.

#organize into sequences of characters
length = 12
sequences = list()
for i in range(length, len(raw_text)):
# select sequence of tokens
seq = raw_text[i-length:i+1]
# store
sequences.append(seq)
print('Total Sequences: %d' % len(sequences))
#Total Sequences: 597
# save tokens to file, one dialog per line
def save_doc(lines, filename):
data = '\n'.join(lines)
file = open(filename, 'w')
file.write(data)
file.close()
# save sequences to file
out_filename = 'char_sequences.txt'
save_doc(sequences, out_filename)

Run the example to create the char_sequences.txt file. Take a look inside you should see something like the following:

From fairest 
rom fairest c
om fairest cr
m fairest cre
fairest crea
fairest creat
airest creatu
irest creatur
rest creature
est creatures
st creatures

Train Language Model

We will develop a neural language model for the prepared sequence data. The model will read encoded characters and predict the next character in the sequence. A Long Short-Term Memory recurrent neural network hidden layer will be used to learn the context from the input sequence in order to make the predictions.

1> LOAD DATA & ENCODING SEQUENCES

The first step is to load the prepared character sequence data from char sequences.txt. We can use the same load doc() function developed in the previous section. Once loaded, we split.

The sequences of characters must be encoded as integers. This means that each unique character will be assigned a specifi c integer value and each sequence of characters will be encoded as a sequence of integers. We can create the mapping given a sorted set of unique characters in the raw input data. The mapping is a dictionary of character values to integer values.

#load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))

Next, we can process each sequence of characters one at a time and use the dictionary mapping to look up the integer value for each character.The result is a list of integer lists. We need to know the size of the vocabulary later. We can retrieve this as the size of the dictionary mapping. Running this piece, we can see that there are 37 unique characters in the input sequence data.

sequences = list()
for line in lines:
#integer encode line
encoded_seq = [mapping[char] for char in line]
#store
sequences.append(encoded_seq)
#vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

2> SPLIT INPUTS & OUTPUTS

The sequences have been integer encoded, we can separate the columns into input and output sequences of characters.

Next, we need to one hot encode each character. That is, each character becomes a vector as long as the vocabulary (37 elements) with a 1 marked for the specific character. This provides a more precise input representation for the network. It also provides a clear objective for the network to predict, where a probability distribution over characters can be output by the model and compared to the ideal case of all 0 values with a 1 for the actual next character. We can use the to categorical() function in the Keras API to one hot encode the input and output sequences.

sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]

sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)
X.shape

3> FIT MODEL

Model is defined with an input layer that takes sequences that have 12 time steps and 37 features for the one hot encoded input sequences. Rather than specify these numbers, we use the second and third dimensions on the X input data. This is so that if we change the length of the sequences or size of the vocabulary, we do not need to change the model defination. The model has a single LSTM hidden layer with 75 memory cells, chosen with a little trial and error. The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. A softmax activation function is used on the output layer to ensure the output has the properties of a probability distribution.

# define the model
def define_model(X):
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# summarize defined model
model.summary()
# plot_model(model, to_file='model.png', show_shapes=True)
return model

The model is learning a multi-class classfication problem, therefore we use the categorical log loss intended for this type of problem. The efficient Adam implementation of gradient descent is used to optimize the model and accuracy is reported at the end of each batch update. The model is fit for 100 training epochs, again found with a little trial and error. Running this prints a summary of the defined network as a sanity check.

# define model
model = define_model(X)
# fit model
model.fit(X, y, epochs=100, verbose=2)

The defined model is then saved to file with the name model.png.

4> SAVE MODEL

After the model is fit, we save it to file for later use. The Keras model API provides the save() function that we can use to save the model to a single file, including weights and topology information. We also save the mapping from characters to integers that we will need to encode any input when using the model and decode any output from the model.

# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

B — Use the trained character-based language model to generate text

Generate Text

We will use the learned language model to generate new sequences of text that have the same statistical properties.

1> LOAD DATA

The first step is to load the model saved to the file model.h5. We can use the load model() function from the Keras API. We also need to load the pickled dictionary for mapping characters to integers from the file mapping.pkl. We will use the Pickle API to load the object.

from pickle import load
from keras.models import load_model
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
# load the model
model = load_model('model.h5')
# load the mapping
mapping = load(open('mapping.pkl', 'rb'))

2> GENERATE CHARACTERS

We must provide sequences of 12 characters as input to the model in order to start the generation process. We will pick these manually. A given input sequence will need to be prepared in the same way as preparing the training data for the model. First, the sequence of characters must be integer encoded using the loaded mapping.

Next, the integers need to be one hot encoded using the to categorical() Keras function. We also need to reshape the sequence to be 3-dimensional, as we only have one sequence and LSTMs require all input to be three dimensional (samples, time steps, features).

We can then use the model to predict the next character in the sequence. We use predict classes() instead of predict() to directly select the integer for the character with the highest probability instead of getting the full probability distribution across the entire set of characters. We can then decode this integer by looking up the mapping to see the character to which it maps.

This character can then be added to the input sequence. We then need to make sure that the input sequence is 12 characters by truncating the first character from the input sequence text. We can use the pad sequences() function from the Keras API that can perform this truncation operation. Putting all of this together, we can defi ne a new function named generate seq() for using the loaded model to generate new sequences of text.

# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
in_text = seed_text
len(in_text)
# generate a fixed number of characters
for _ in range(n_chars):
# encode the characters as integers
encoded = [mapping[char] for char in in_text]
# truncate sequences to a fixed length
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
# one hot encode
encoded = to_categorical(encoded, num_classes=len(mapping))
encoded.shape
#encoded = encoded.reshape( 1,encoded.shape[0], encoded.shape[1])
# predict character
yhat = model.predict_classes(encoded, verbose=0)
# reverse map integer to character
out_char = ''
for char, index in mapping.items():
if index == yhat:
out_char = char
break
# append to input
in_text += out_char
return in_text

Running the example generates three sequences of text. The first is a test to see how the model does at starting from the beginning of the rhyme. The second is a test to see how well it does at beginning in the middle of a line. The final example is a test to see how well it does with a sequence of characters never seen before.

# test start of rhyme
print(generate_seq(model, mapping, 12, 'From fairest', 70))
# test mid-line
print(generate_seq(model, mapping, 12, 'Making a famine', 70))
# test not in original
print(generate_seq(model, mapping, 12, 'hello worl', 70))

We can see that the model did very well with the first two examples, as we would expect.We can also see that the model still generated something for the new text, but it is nonsense.

This section provides more resources on the topic if you are looking go deeper.

Summary

How to develop a character-based neural language model. Specifically, you learned:

  • How to prepare text for character-based language modeling.
  • How to develop a character-based language model using LSTMs.
  • How to use a trained character-based language model to generate text.

That’s it for today. Source code can be found on Github. I am happy to hear any questions or feedback. Connect with me at linkdin.