From First Principles: Make an AI customer service bot in PyTorch

In this three-part series, I’ll walk you through how I made and productionized a conversational bot using a recurrent neural network. First, in this article, we’ll build the network and train it on some toy sentences, digging into neural net theory on the way. Here’s a sample of the output:

In the two subsequent articles, we’ll:

  • Improve the data, training, and prediction so the PyTorch bot can genuinely hold conversations; and
  • Deploy the bot to production. An RNN is necessarily stateful, which makes for interesting deployment concerns.

If you follow my writing, you’ll know already that I don’t like complication, shortcuts, or third party libraries: all this will be from scratch, in a form that I hope is easy to relate.

Step 1: Get some data

There’s no point spending hours training a dope model before knowing that the basics are right. Considering we’re still at PoC stage right now, let’s choose some simple training data:

sentences = ['How may I help you?',
'Can I be of assistance?',
'May I help you with something?',
'May I assist you?']

Step 2: Tokenize that data

Fundamentally, neural networks operate on numbers, so we have to turn each word into a numeric token. This isn’t cheating: I’m still building a recurrent network here that’s aware of the past, it’s not some bag-of-words model. I guess if you think about it, the human brain doesn’t operate on words either, but translates them to electricity and neurotransmitters that it can process.

We can build a dictionary that converts words to integers, and at the same time another dict that converts back (ints to words):

words = dict()
reverse = dict()
i = 0
for s in sentences:
s = s.replace('?',' <unk>')
for w in s.split():
if w.lower() not in words:
words[w.lower()] = i
reverse[i] = w.lower()
i = i + 1

There are more sophisticated ways to do this conversion: commonly, words that occur in the corpus with a frequency < N might be replaced with the catch-all token <unk>, we might similarly replace proper nouns, we should probably mark the start and end of sentences, we might stem words such that writing and written are both replaced with write. There are lots of NLP techniques that compensate for how bad AI traditionally was.

At this point there’s no reason to use them, but I will bring in spaCy in the second article of this series, a state of the art NLP library I’ve found easy to use and had success with before.

Based on the very simple four training sentences above, the dictionary that’s built up looks like this:

{'<unk>': 5,
'assist': 12,
'assistance': 9,
'be': 7,
'can': 6,
'help': 3,
'how': 0,
'i': 2,
'may': 1,
'of': 8,
'something': 11,
'with': 10,
'you': 4}

Step 3: Design a neural network

The idea behind this conversational bot is that, given an input word, it should be able to predict the next word, and that it should have some memory of the conversation in predicting that next word. In the example above you can see I triggered the net with a starting token <unk>. From that, the net’s next word was ‘may’. Next, from [<unk>, may] it predicted ‘i’. And so on.

The design step is, ahem, highly iterative (aka finger in the air). Certainly for language we need some kind of recurrent network, because it needs to keep track of previous words, as well as the overall context of the conversation, in order to form a sentence. (One of the unstated design requirements here is that we want the network to figure out its own best way to keep track of the context; we don’t want to have to do that ourselves).

The idea behind a recurrent network is simple:

The net’s input consists of a word, plus the previous predicted word (which may or may not have been accurate, and also changes continuously as it aggregates past inputs — so it can be thought of as a kind of hidden state rather than an actual word). From these two things it outputs its next prediction.

The problem here is that there’s a feedback loop, and positive feedback is inherently unstable (think of the amplified screeching when a microphone’s too close to a speaker). Errors get magnified over time and training cycles, leading to the well known exploding gradient problem. This might happen if neurons have for example a relu non-linearity. Alternatively, if neurons’ outputs are clipped to between -1 and 1 as with a sigmoid or tanh output function, well, when quantities less than zero are multiplied repeatedly, they quickly asymptote to zero, especially in computers’ limited binary representation. That’s the vanishing gradient problem.

The good news is that these problems were largely solved by LSTM neurons and more recently in a different way by GRU. The implementation details are beyond the scope of this article; besides, they are basic building blocks of PyTorch.

The network architecture — how many cells per layer, how many layers, do we use dropout — are not critical implementation details, just things we can tweak as we find necessary.

Step 4: Inputs and Outputs

That’s the main body of the network, but still we have to settle on what the input and output look like. First, the input.

It’s been known for ages that networks work better given sparse categorical inputs rather than dense ones. For example, with the vocabulary of 13 words listed above, it’s easier to train the network with 13 inputs that are all zeroes except for one, than 1 input that varies between 0–1 in increments of 1/13. That’s one-hot encoding.

Nobody really uses one-hot encoding any more. With 13 inputs, why waste 12 of them on zeroes and one on a one, when via a simple lookup we could choose a vector formed out of any 13 scalars? Better still, treat those 13 scalars as trainable parameters, too. That’s what embeddings do, and an embedding layer is treated as any other, getting parameter updates via backpropagation.

Left: one-hot, Right: embedding cool

So that’s an embedding, going into a recurrent layer, and at the output we’ll need a layer with the number of outputs equal to the length of our vocabulary (so, 13). Framing this as a kind of ‘classification’ problem, if neuron 6 had the highest activation of the 13 output neurons, we’d say that word 6 was the net’s output, and so on. That’s argmax.

But there are a few problems with that approach, including most notably that it’s not differentiable, and also that outputs could start to go arbitrarily high. Thus, along came softmax which is a simple, differentiable formula to squish each output between 0–1 (and the sum of all outputs is 1, so it can be interpreted as a probability):

Much neural net math looks complex in notation form, but isn’t actually.

For classification problems like this where there’s a single right answer and we don’t care about less likely classes, taking the log of the softmax helps the network train faster. If the net is confident in an output, that is, softmax tends to 1, log softmax will tend to 0, leading to a smaller gradient and smaller weight updates. Likewise, where the pseudo-probabilities are smaller, PyTorch’s log_softmax will ensure bigger gradients.

Similarly for the loss function: we want more-wrong probabilities to yield higher errors so the gradient changes faster, and less-wrong probabilities to yield smaller errors. Those are nice effects of using Cross Entropy Loss, which also happens to be the standard error function for a problem like this. This function, unlike something like mean squared error, never goes to zero; the network should never get stuck (and also never achieve ‘perfection’, which probably would mean overfitting anyway).

Step 5: Feed data into the network

This is just a standard iterator made from the tokenized training data in step 2. About the only interesting thing here is that X, the training samples that are being fed in to the network, have requires_grad=False. Normally the training inputs will require us to initialize them to have a gradient, so that everything after them in the network also gets a gradient and can learn.

But since the network’s first layer, the embedding, is a simple lookup: that doesn’t make sense. The embedding’s parameters and output automatically get given a gradient by PyTorch.

Also, since this article is supposed to stop at a working PoC — for simplicity I’ve omitted batching and CUDA at this point. I also haven’t bothered to include test/validation splits.

Step 6: Construct the network according to the design

Building a neural network in PyTorch is very easy:

To clarify some of the numbers:

  • There are len(words) embeddings (=13), each a vector of length 10. Why 10? Pretrained GloVe vectors, included with SpaCy, run to length 300. Longer is better, but not for training time with such a tiny vocabulary as we currently have.
  • The next layer, the LSTM, takes those 10 inputs and feeds them into two layers each with 20 neurons. There’s a bit of dropout aka regularization added to prevent overfitting.
  • We have to initialize the LSTM layers’ hidden state. I just set it to zeros, with size 2 layers, batch size of one, for 20 neurons in each layer (matching above). Eagle-eyed readers will note that it’s a tuple. That’s because LSTM layers have a hidden state h, and a (also hidden) cell state c. They can be initialized identically.
  • The output layer takes the final LSTM layer’s 20 outputs and connects them to its len(words) outputs: 13 words, 13 categories, 13 output neurons.
  • We need to do a bit of gymnastics with the output, just to get things into the form that PyTorch wants. -1 is often a safe bet as it just means ‘whatever’ provided that everything else matches what you need.

Step 7: Make the training loop and get some results!

This model got down to a loss after 300 epochs of 0.27. (It’ll be different for yours, but by all means run this cell many times, lessening the learning rate over time). That seems pretty good, but is hard to evaluate except by trying it:

def get_next(word_):
word = word_.lower()
out = m(Variable(torch.LongTensor([words[word_]])))
return reverse[int(out.max(dim=1)[1].data)]
def get_next_n(word_, n=3):
print(word_)
for i in range(0, n):
word_ = get_next(word_)
print(word_)
get_next_n('<unk>', n=12)

Notice how in using the model, all you have to do is take the max of the outputs, and that reflects the dictionary position of the word. I kind of like the chaotic entropy in the output right now; we could do better for sure, but it’s clear we’re doing this via some kind of AI, rather than a rules engine:

Like DJs deliberately making bad mixes to prove they’re doing it live

The complete notebook for this Part 1 of the series is here. Next time I’ll show how I made this model much better and more interactive, and then, in the final part, how it got to production.

Source: Deep Learning on Medium