Build your own WhatsApp text generator (and learn all about language models)

Source: Deep Learning on Medium

Build your own WhatsApp text generator (and learn all about language models)

A practical end-to-end Deep Learning NLP example

A normal conversation on WhatsApp, but all is not what it seems. The people are real; the chat is fake. It was generated by a language model trained on a real conversation history. In this post I will take you through the steps to build your own version using the power of recurrent neural networks and transfer learning.

Requirements

I have used the fastai library inside of Google’s free research tool for data science, Colab. This means very little time (and no money) getting set up. All you need to build your own model is the code laid out in this post (also here) and the following:

  • A device to access the internet
  • A google account
  • WhatsApp chat histories

I discuss some theory and delve into some of the source code, but for more detail there are various links to academic papers and documentation. If you want to learn more I also strongly recommend you look into the excellent fastai course.

Drive and Colab initial set up

First, we’re going to create a space on Google Drive for your notebook. Click “New” and give the folder a name (I used “whatsapp”).

Then go into your new folder, click “new” again, open up a Colab notebook, and give it a suitable name.

Finally, we want to enable a GPU for the notebook. This will speed up the training and text generation process significantly (GPUs are more efficient than CPUs for matrix multiplications, the main computation under the hood in neural networks).

Click “Runtime” from the top menu, then “Change runtime type”, and select “GPU” for the hardware accelerator.

WhatsApp data

Now let’s get some data. The more the better, so you’ll want to pick a chat with a reasonably long history. Also, explain what you’re doing to anyone else involved in the conversation and get their permission first.

To download the chat, click options (three vertical dots on the top right), select “More”, then “Export chat”, “Without media” and if you have Drive installed on your mobile device you should have an option to save to your newly created folder (otherwise, save the file and manually add to Drive).

Data preparation

Back to the notebook. Let’s start by updating the fastai library.

!curl -s https://course.fast.ai/setup/colab | bash

Then some standard magic commands and we bring in three libraries: fastai.text (for the model), pandas (for data preparation) and re (for regular expressions).

## magic commands
%reload_ext autoreload
%autoreload 2
%matplotlib inline
## import required packages
from fastai.text import *
import pandas as pd
import re

We want to link this notebook to Google Drive in order to use the data we just exported from WhatsApp and save any models we create. To do so run the following code, go to the provided link, select your google account, and copy the authorization code back into your notebook.

## Colab google drive stuff
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'whatsapp/'

We have some cleaning to do, but the data is currently in .txt format. Not ideal. So here’s a function to take the text file and convert it into a pandas dataframe with one row for each chat entry, along with a timestamp and the name of the sender.

## function to parse the whatsapp extract file
def parse_file(text_file):
'''Convert WhatsApp chat log text file to a Pandas dataframe.'''

# some regex to account for messages taking up multiple lines
pat = re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)',
re.S | re.M)
with open(text_file) as f:
data = [m.group(1).strip().replace('\n', ' ')
for m in pat.finditer(f.read())]
sender = []; message = []; datetime = []
for row in data:

# timestamp is before the first dash
datetime.append(row.split(' - ')[0])
# sender is between am/pm, dash and colon
try:
s = re.search(' - (.*?):', row).group(1)
sender.append(s)
except:
sender.append('')

# message content is after the first colon
try:
message.append(row.split(': ', 1)[1])
except:
message.append('')

df = pd.DataFrame(zip(datetime, sender, message),
columns=['timestamp',
'sender',
'text'])

# exclude any rows where format does not match
# proper timestamp format
df = df[df['timestamp'].str.len() == 17]
df['timestamp'] = pd.to_datetime(df.timestamp,
format='%d/%m/%Y, %H:%M')
# remove events not associated with a sender
df = df[df.sender != ''].reset_index(drop=True)

return df

Let’s see how it works. Create the path to your data, apply the function to the chat export, and take a look at the resulting dataframe.

## path to directory with your file
path = Path(base_dir)
## parse whatsapp extract, replace chat.txt with your
## extract filename
df = parse_file(path/'chat.txt')
## take a look at the result
df[205:210]

Perfect! This is a small snippet of conversation between me and my lovely wife. One of the advantages of this format is I can easily create a list of participant names in lower case, replacing any spaces with underscores. This will help later.

## list of conversation participants
participants = list(df['sender'].str.lower().
str.replace(' ', '_').unique())
participants

In this case there are only two names, but it will work with any number of participants.

Finally, we need to think about how we want this text to be fed into our model network. Normally, we would have multiple stand-alone pieces of text (e.g. wikipedia articles or IMDB reviews), but what we have here is a single ongoing stream of text. One continuous conversation. That’s what we create. One long string, including the sender names.

## concatenate names and text into one string
text = [(df['sender'].str.replace(' ', '_') + ' ' + df['text']).str.cat(sep = ' ')]
## show part of string
text[0][8070:8150]

Looks good. We’re ready to get this into a learner.

Learner creation

To use the fastai API we now need to create a DataBunch. This is an object that can then be used inside a Learner to train a model. In this case, it has three key inputs: the data (split into training and validation sets), labels for the data, and the batch size.

Data

To split training and validation let’s just pick a point somewhere in the middle of our long conversation string. I went for the first 90% for training, last 10% for validation. Then we can create a couple of TextList objects, and quickly check they look the same as previously.

## find the index for 90% through the long string
split = int(len(text[0])*0.9)
## create TextLists for train/valid
train = TextList([text[0][:split]])
valid = TextList([text[0][split:]])
## quick look at the text
train[0][8070:8150]

It’s worth going a little deeper on this TextList.

It is, more or less, a list of text (with just one element in this case), but let’s take a quick look at the source code to see what else is there.

Ok, a TextList is a class with a bunch of methods (functions, anything in the above starting with “def”, all minimized). It inherits from an ItemList, in other words it’s a kind of ItemList. By all means go and look up an ItemList, but I’m most interested in the “_processor” variable. The processor is a list with a TokenizeProcessor and a NumericalizeProcessor. These sound familiar in an NLP context:

Tokenize — process text and break it up into its individual words

Numericalize — replace those tokens with numbers that correspond to the position of the word in a vocab

Why am I highlighting this? Well, it certainly helps to understand the rules being used to process your text, and digging into this part of the source code and documentation will help you do that. But, specifically, I want to add my own new rule. I feel we should show that the sender names in the text are similar in some way. Ideally, I’d like a token before each sender name that tells the model “this is a sender name”.

How can we do this? That’s where _processor comes in handy. The documentation tells us we can use it to pass in a custom tokenizer.

We can therefore create our own rule and pass it in with a custom processor. I still want to keep the previous defaults, so all I need to do is add my new function to the existing list of default rules, and add this new list in to our custom processor.

## new rule
def add_spk(x:Collection[str]) -> Collection[str]:
res = []
for t in x:
if t in participants: res.append('xxspk'); res.append(t)
else: res.append(t)
return res
## add new rule to defaults and pass in customer processor
custom_post_rules = defaults.text_post_rules + [add_spk]
tokenizer = Tokenizer(post_rules = custom_post_rules)
processor = [TokenizeProcessor(tokenizer=tokenizer),
NumericalizeProcessor(max_vocab=30000)]

The function adds the token ‘xxspk’ before every name.

Before processing: “…eggs, milk Paul_Solomon Ok…”

After processing: “…eggs , milk xxspk paul_solomon xxmaj ok…”

Note that I’ve applied some of the other default rules, namely identifying capitalised words (adds ‘xxmaj’ before capitalized words), and separating out punctuation.

Labels

We’re going to create something called a language model. What is this? Simple, it’s a model that predicts the next word in a sequence of words. To do this accurately the model needs to understand language rules and the context. In some ways, it needs to learn the language.

So what’s the label? Easy, it’s the next word. More specifically, in the model architecture we’re using, for a sequence of words we can create a target sequence by taking that same sequence of tokens and shifting it one word to the right. At any point in the input sequence, we can look at that same point in the target sequence and find the correct word to predict (i.e. the label).

Input Sequence: “… eggs , milk spkxx paul_solomon xxmaj …”

Label/Next Word: “ok”

Target Sequence: “… , milk spkxx paul_solomon xxmaj ok …”

We do this by using the label_for_lm method (one of the functions in TextList class above).

## take train and valid and label for language model
src = ItemLists(path=path, train=train, valid=valid).label_for_lm()

Batchsize

Neural networks are trained by passing in batches of data in parallel, so the final input for the databunch is our batchsize. We use 48, meaning 48 text sequences are passed through the network at a time. Each of these text sequences is 70 tokens long by default.

## create databunch with batch size of 48
bs = 48
data = src.databunch(bs=bs)

We now have our data! Let’s create the learner.

## create learner
learn = language_model_learner(data, AWD_LSTM, drop_mult=0.3)

Fastai gives us an option to quickly create a language model learner. All we need is our data (we have it already) and an existing model. This object has an argument ‘pretrained’ set to ‘True’ by default. This means that we’re going to take a pre-trained language model and fine-tune it to our data.

This is called transfer learning, and I love it. Language models need a lot of data to work well but we don’t have anywhere near enough in this case. To solve this problem, we can take an existing model, trained on massive amounts of data, and fine-tune it to our text.

In this case we use an AWD_LSTM model which has been pre-trained on the WikiText-103 dataset. AWD LSTM is a language model that uses a type of architecture called a recurrent neural network. It’s trained on text, and in this case it has been trained on a whole load of wikipedia data. We can look up how much.

This model has been trained on over 100m tokens from 28k Wikipedia articles with state-of-the-art performance. Sounds like a great starting point for us!

Let’s get a quick sense of the model architecture.

learn.model

I’ll break this down.

  1. Encoder — The vocab for our text will have any word that has been used more than twice. In this case that’s 2,864 words (yours will be different). Each of these words is represented using a vector of length 2,864 with a 1 in the appropriate position and all zeros elsewhere. Encoding takes this vector and multiplies it by a weight matrix to squash it down into a length 400 word embedding.
  2. LSTM cells — The length 400 word embedding is then fed into a LSTM cell. I won’t go into the detail of the cell, all you need to know is that a length 400 vector goes in to the first cell, and a length 1,152 vector comes out. Two other things worth noting: this cell has a memory (it’s remembering previous words) and the output of the cell is fed back into itself and combined with the next word (that’s the recurrent part), as well as being pushed into the next layer. There are three of these cells in a row.
  3. Decoder — The output of the third LSTM cell is a length 400 vector, this is expanded out again into a vector with the same length as your vocab (2,864 in my case). This gives us the prediction for the next word, and can be compared with the actual next word to calculate our loss and accuracy.

Remember that this is a pre-trained model, so wherever possible the weights are exactly as were trained using the WikiText data. This will be the case for the LSTM cells, and for any words that are in both vocabs. Any new words are initialized by the mean of all embeddings.

Now let’s fine-tune it with our own data so that the text it generates sounds like our WhatsApp chat and not a Wikipedia article.

Training

First, we’re going to do some frozen training. This means we only update certain parts of the model. Specifically, we’re only going to train the last layer group. We can see above that the last layer group is “(1): LinearDecoder”, the decoder. All the word embeddings and LSTM cells will remain the same during training, it’s just the final decoding stage that will be updated.

One of the most important hyper parameters is the learning rate. Fastai gives us a useful little tool to quickly find a good value.

## run lr finder
learn.lr_find()
## plot lr finder
learn.recorder.plot(skip_end=15)

Rule of thumb is to find the steepest part of the curve (i.e. the point of fastest learning). 1.3e-2 looks to be about right here.

Let’s go ahead and train for one epoch (once through all the training data).

## train for one epoch frozen
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

At the end of the epoch we can see the loss on the training and validation sets, and the accuracy on the validation set. We’re correctly predicting 41% of the next words in the validation set. Not bad.

Frozen training is a great way to start with transfer learning, but now we can open up the entire model by unfreezing. This means the encoder and LSTM cells will now be included in our training updates. It also means that the model will be more sensitive, so we reduce our learning rate to 1e-3.

## train for four further cycles unfrozen
learn.fit_one_cycle(4, 1e-3, moms=(0.8,0.7))

Accuracy up to 44.4%. Note that the training loss is now lower than validation, that’s what we want to see, and the validation loss has bottomed out.

Note that you will almost certainly find that your loss and accuracy are different to the above (some conversations are more predictable than others) so I suggest you play around with the parameters (learning rates, training protocol, etc.) to try and get the best performance from your model.

Text generation

We now have a language model, fine-tuned to your WhatsApp conversation. To generate text all we need to do is set it running and it’ll start predicting the next word over and over for as long as you ask it to.

Fastai gives us a useful predict method to do exactly this, all we need to do is give it some text to get it started, and tell it how long to run for. The output will still be in tokenised format, so I wrote a function to clean up the text and print it out nicely in the notebook.

## function to generate text
def generate_chat(start_text, n_words):
text = learn.predict(start_text, n_words, temperature=0.75)
text = text.replace(" xxspk ",
"\n").replace(" \'",
"\'").replace(" n\'t",
"n\'t")
text = re.sub(r'\s([?.!"](?:\s|$))', r'\1', text)

for participant in participants:
text = text.replace(participant, participant + ":")

print(text)

Let’s go ahead and start it off.

## generate some text of length 200 words
generate_chat(participants[0] + " are you ok ?", 200)

Nice! It certainly reads like one of my conversations (focused largely on travelling home after work each day), the context is sustained (that’s the LSTM memory at work), and the text even looks to be tailored to each participant.

Final thoughts

I’ll finish with a word of caution. As with many other AI applications, fake text generation can be used at scale for unethical purposes (e.g. spreading messages designed to harm on the internet). I’ve used it here to provide a fun and hands-on way to learn about language models, but I encourage you to think about how the methods described above can be used for other purposes (e.g. as an input to a text classification system) or to other kinds of sequence-like data (e.g. music composition).

This is an exciting and fast-moving field with the potential to build powerful tools that create value and benefit society. I hope that this post shows you that anyone can get involved.