Building Jarvis, the Generative Chatbot with an Attitude

Original article was published by Agustinus Nalwan on Artificial Intelligence on Medium


Building Jarvis, the Generative Chatbot with an Attitude

I used a Neural Machine Translation model trained on Amazon SageMaker, Amazon Polly, Google Speech API, and an audio routing tool, to build Jarvis — a chatbot who can talk to you in a video conference call.

Carsales.com, the company I work for, is holding a hackathon event. This is an annual event where everyone (tech or non tech) comes together to form a team and build anything — anything at all. Well, preferably you would build something that has a business purpose, but it is really up to you. This idea for this chatbot actually came from Jason Blackman, our Chief Information Officer at carsales.com.

Carsales Hackathon

Given that our next hackathon is an online event, thanks to COVID-19, wouldn’t it be cool if we could host a Zoom webinar, where any carsales.com employee could jump in to hang out and chat with an AI bot which we could call Jarvis, who would always be available to chat with you.

Brainstorming

After tossing around ideas, I came up with a high-level scope. Jarvis would need to have a visual presence, just as would a human webinar participant. He needs to be able to listen to what you say and respond contextually with a voice.

I wanted him to be as creative as possible in his replies and to be able to generate a reply on the fly. Most chatbot systems are retrieval based, meaning that they have hundreds or thousands of prepared sentence pairs (source and target), which form their knowledge bases. When the bot hears a sentence, it will then try to find the most similar source sentence from its knowledge base and simply return the paired target sentence. Retrieval based bots such as Amazon Alexa and Google Home are a lot easier to build and work very well to deliver specific tasks like booking a restaurant or turning the lights on or off, whereas the conversation scope is limited. However, for entertainment purposes like casual chatting, they lack creativity in their replies when compared to the generative counterpart.

For that reason, I wanted a generative based system for Jarvis. I am fully aware that it is likely I will not achieve a good result. However, I really want to know how far the current generative chatbot technology has come and what it can do.

Architecture

Ok, so I knew what I wanted. Now it was time to really contemplate how on earth I was going to build this bot.

We know that the first component needed is a mechanism to route audio and video. Our bot needed to be able to hear conversations on Zoom, so we needed a way to route the audio from Zoom into our bot. This audio would then need to be passed into a speech recognition module, which would give us the conversation as text. We would then need to pass this text into our generative AI model to get a reply, which would be turned into speech by using text-to-speech tech. While the audio reply is being played, we would need an animated avatar, which, apart from fidgeting, could also move his lips in sync with the audio playback. The avatar animation and audio playback needed to be sent back to Zoom for all meeting participants to hear and see. Wow! It was indeed a pretty complex system.

Jarvis’ architecture diagram

To summarise, we needed the following components:

  • Audio/video routing
  • Speech recognition
  • Generative AI model
  • Text to Speech
  • Animated avatar
  • Controller

Audio/video routing

I love it when someone else has done the hard work for me. Loopback is an audio tool that allows you to redirect audio from any application into a virtual microphone. All I needed were two audio routings. The first one was to route the audio from the Zoom app into a virtual microphone, from which my bot would listen.

Audio routing 1 diagram

The second routing was to route the chatbot audio output into yet another virtual microphone, where both the Zoom app and our avatar tool would listen to. It is obvious that Zoom would need to listen to this audio. However, why would our avatar tool need this audio? For lip-syncing, so that our avatar could move his lips according to the audio playback. You will see more details on this in later sections of this blog.

Audio routing 2 diagram

Speech Recognition

This module is responsible for processing incoming audio from Zoom via a virtual microphone and turning it into a text. There were a couple of offline and online speech recognition frameworks to choose from. The one I ended up using was Google Speech API. It is an online API with an awesome python interface that delivers superb accuracy, and more importantly, allows you to stream and recognise audio in chunks, which minimises the processing time significantly. I would like to emphasise that latency (how long it takes for the bot to respond to a query) is very critical for a chat bot. A slow responding bot can look very robotic and unrealistic.

Most of the time, the Google Speech API returns a response in less than a second after a sentence is fully heard.

Generative AI Model

This is the part that I spent most of my time on. After spending a day or two catching up with the recent developments in generative chatbot techniques, I found out that Neural Machine Translation models seemed to have been quite popular recently.

The concept was to feed an encoder-decoder LSTM model with word embedding from an input sentence, and to be able to generate a contextual output sentence. This technique is normally used for language translation. However, given that the job is simply mapping out one sentence to another, it can also be used to generate a reply to a sentence (in theory).

Neural Machine Translation Model Architecture

In layman’s terms, an input sentence is broken up into words. Each word is then mapped into an integer id, which is then passed into an embedding layer. During training, the embedding layer learns to turn this list of ids into a list of embedding vectors, which are ‘x’ dimension in size. This vector is constructed in such a way that words with similar meanings yield similar vectors, which will provide deeper information, rather than just a single integer value. These vectors are then passed into an LSTM encoder layer that turns them into a thought vector (some call them a latent vector), which contains information about the whole input sentence. Please note that there is a popular misconception that there are many LSTM layers or blocks, when in fact there is only one. The many blocks in the diagram above show the same LSTM block being called one-time step after another processing word by word.

The decoder on the right-hand side of the model is responsible for turning this thought vector into an output sentence. A special beginning of sentence <BOS> word is passed as an initial input to the LSTM layer, together with the thought vector, to generate the first word, which is forwarded to the same LSTM layer as an input to generate the next word, and so on and so forth.

Decoder Output Softmax

Going slightly deeper into technical realms, the output of an LSTM decoder unit is actually a number that is passed into a softmax (classification) layer, which returns the probability of each possible word in our vocabulary. The word with the highest probability (in the case above it is ‘I’) is the one picked as an output word, and also passed on as an input to the LSTM decoder layer to generate the next word.

There are a few examples online on how to build this model architecture. However, why build one if someone else has already done the hard work for you? Introducing: The Amazon SageMaker! Amazon SageMaker is a collection of tools and a pipeline to expedite building ML models, and comes with vast array of amazing built-in algorithms such as image classification, object detection, neural style transfer and seq2seq, which is a close variant of the Neural Machine Translation but with an extra attention mechanism.

Amazon SageMaker

The Amazon SageMaker seq2seq model is highly customisable. You can choose how many LSTM units are used, the number of hidden cells, embedding vector dimensions, the number of LSTM layers, etc., which gives me more than enough flexibility to experiment with different parameters to achieve better results.

Getting the Training Set

The selection of the training set is crucial to the success of your chatbot responding with contextual and meaningful replies. The training set needs to be a collection of conversation exchanges between two parties. Specifically, we needed to construct a pair, a source sentence and a target sentence for each entry. For example, ‘Source: How are you?’ ‘Target: I am fine.’ ‘Source: Where do you live?’ ‘Target: I live in Australia’. This type of training set is very hard to get without a significant manual clean-up effort. The most popular dataset that people use is the Cornell Movie Dialogue Corpus Dataset, which is not great (you will see why a little later), but is the best you can get right now.

Cornell Movie Dialog Corpus

This data set consists of 220k lines of conversations taken from a movie dialogue. Each line comprises dialogue from one or more persons. The three example lines below show you some good examples where the conversation starts with a question (in bold) from one person and is followed by an answer from another.

You know French? Sure do … my Mom’s from Canada

And where’re you going? If you must know, we were attempting to go to a small study group of friends.

How many people go here? Couple thousand. Most of them evil

However, there are many more bad examples of lines where the follow up sentence does not make sense because you need the context of the prior conversation or a visual reference for it to make sense. For example, see the following:

What’s the worst? You get the girl.

No kidding. He’s a criminal.

It’s a lung cancer issue. Her favorite uncle

There is no easy way to remove the bad lines without putting in manual labour, which I was not really prepared to do. And even if I did, I would have ended up with much less dataset; not enough to train my AI model with. Hence, I decided to proceed regardless and see how far I could get with this.

I generated multiple pairs of source and target sentences from each conversation line and from each two consecutive sentences, regardless of who said the sentence.

‘What’s the worst? You are broke? What to do next?’ will be turned into two pairs of conversation lines.

What’s the worst? You are broke?

You are broke? What to do next?

This way, I managed to enlarge my dataset. I was fully aware that I could potentially and mistakenly pair a source and target sentence spoken by the same person. However, half of the time, the target sentence still made sense if it was in a reply from others, as shown in the example below.

What’s the worst? You are broke? What do do next?

Tokenising/splitting sentences can be done in two lines of code using the nltk library.

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(text)

I also trimmed a sentence if it was longer than 20 words, as my model could only read an input of up to 20 words and an output of 20 words. Besides this, longer sentences mean a greater context and higher variation, which is a lot harder for an AI model to learn given the size of our training set is not that big.

With the above methods, I managed to get 441k pairs of conversation lines.

Pre-processing the Training Set

The next step was to pre-process this training set further, which involved several steps.

The first step was to remove and replace unwanted strings from the pairs such as all the xml and html tags and multiple dots and dashes.
The next step was to expand the contraction words, for example, ‘you’ll’ was expanded to ‘you will’, ‘I’m’ to ‘I am’, etc., which increased my AI model accuracy. The reason for this is that in later steps, sentences would be turned into lists of words by splitting against delimiter characters like spaces, new lines or tabs. All unique words would form our vocabulary. Contractions like ‘I’m’ would be considered as a new word in the vocabulary, which would increase our vocabulary size unnecessarily, and reduce the effectiveness of our training sets due to fragmentation; hence making it harder for our AI to learn.

I used a very handy python framework called ‘contractions’, which expanded contractions in a sentence with one line of code.

text = contractions.fix(text)

Punctuation was the next thing to tackle. I separated the punctuation from sentences (e.g., ‘How are you?’ was turned into ‘How are you ?’) This was done for similar reasons as for the contractions; to make sure that the punctuation itself would be considered as a word, and that it wouldn’t be merged together with the previous word as a new word, such as ‘you?’ in the example above.

By going through all the steps above, I gained about 441k pairs of training sets and 56k words for the vocabulary.

As a final step, I added a vocabulary pruning so that I could control what the maximum number of words was to support in the vocabulary. The pruning was easily done by removing less frequently used words. A smaller vocabulary size and a larger training set are more favourable. This make sense — imagine the difficulties of teaching your kids five new words by giving them 100 sample sentences, as opposed to 100 new words with the same 100 sample sentences. Using fewer words more frequently in more sentences will of course help you learn better.

Training Data Preparation Pipeline

From my many experiments, I found that a vocabulary of 15k words gave me the best results, which yielded 346k training sets, equal to a ratio of 23 as opposed to the original 441k/56k =7.9.

Training the Seq2Seq Model

Kick starting the training was super easy thanks to Amazon SageMaker, which already provided an example Jupyter notebook on how to train a Seq2Seq model.

I just needed to customise the S3 bucket, where my training file was located, and add my data pre-processing code. Next, I will show you how easy it is to train a seq2seq model in SageMaker, by showing you snippet of coding here and there. Check out my jupyter notebook if you want to see the complete source code.

First, you need to let SageMaker know which built-in algorithm you want to use. Each algorithm is containerised and available from a specific URL.

from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(region_name, 'seq2seq')

Next, you need to construct the training job description which provides a few important information you need to set about the training job. First, you need to set the location of the training set and where you store the final model.

"InputDataConfig": [
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
"S3DataDistributionType": "FullyReplicated"
}
},
},
....
"OutputDataConfig": {
"S3OutputPath": "s3://{}/{}/".format(bucket, prefix)
},

After this, you need to choose what machine or instance you want to run this training on. Seq2seq is quite a heavy model, so you need a GPU machine. My recommendation is ml.p3.8xlarge, which has four NVIDIA V100 GPUs.

"ResourceConfig": {
"InstanceCount": 1,
"InstanceType": "ml.p3.8xlarge",
"VolumeSizeInGB": 50
}

Finally, you need to set the hyper-parameters. This is where I spent 30% of my time, a close second to experimenting with data preparation strategy. I built various models under different settings and compared their performances to come up with the best configuration.

Remember, I chose to limit my sentences to 20 words. The first two lines below are the reason why. My LSTM model was only trained to recognise an input of up to 20 words and an output of up to 20 words.

"HyperParameters": {
"max_seq_len_source": "20",
"max_seq_len_target": "20",
"optimized_metric": "bleu",
"bleu_sample_size": "1000",
"batch_size": "512",
"checkpoint_frequency_num_batches": "1000",
"rnn_num_hidden": "2048",
"num_layers_encoder": "1",
"num_layers_decoder": "1",
"num_embed_source": "512",
"num_embed_target": "512",
"max_num_batches": "40100",
"checkpoint_threshold": "3"
},

Normally, the larger your batch-size (if your GPU RAM can handle it), the better, as you can train more data in one go to speed up the training process. 512 seems to be the max size for p3.8xlarge. Some people may argue that different batch sizes produce slightly different accuracies, however I was not aiming to win a Nobel prize here, so small accuracy differences did not really matter much to me.

I used one layer for each encoder and decoder, each having 2,048 hidden cells with the word embedding size of 512. A checkpoint and evaluation were then performed at each batch of 1,000, which was equal to 1,000 x 512 samples per batch (512k pair samples). So, the training would go to 1.5x the overall training sets (346k in total) until it performed an evaluation against this model. At each checkpoint, the best evaluated model was kept and then, finally, saved into our output S3 bucket. SageMaker also has the capability to early termination. For example, if the model does not improve after three consecutive checkpoints (‘checkpoint_threshold’), the training will stop.

Also, ‘max_num_batches’ is a safety net that can stop the training regardless. In the case that your model keeps improving forever (which is a good thing), this will protect you so that the training cost won’t break your bank (which is not a good thing)

Training the Model

It only takes two lines of code to kick start the training. Then you just have to wait patiently for a few hours, depending on the instance type you use. It took me one and a half hours using p3.8xlarge instance.

sagemaker_client = boto3.Session().client(service_name='sagemaker')
sagemaker_client.create_training_job(**create_training_params)

As you can see below, the validation metrics found the best performing model at checkpoint number eight.

Bleu evaluation metrics

Model Evaluation

I intentionally skipped the discussion of the ‘optimized_metric’ earlier, as it is worth having its own section to explain it properly. Evaluating a generative AI Model is not a straightforward task. Often, there isn’t a one to one relationship between a question and an answer. For example, ten different answers are equally good for a question, which leads to a very broad scope of mapping. For a language translation task, the mapping is much narrower. However, for a chatbot, the scope is increased dramatically, especially when you use a movie dialogue as a training set. An answer to ‘Where are you going?’ could be any of the following:

  • Going to hell
  • Why do you ask?
  • I am not going to tell you.
  • What?
  • Can you say again?

There are two popular evaluation metrics to choose from for seq2seq algorithm in Amazon SageMaker: perplexity and bleu. Perplexity evaluates the generated sentences by taking random samples word by word using the probability distribution model in our training set, which is very well explained in this article.

Bleu evaluates the generated sentence against the target sentence by scoring the word n-gram match and penalising a generated sentence if it is shorter than the target sentence. This article explains how it works and advises against using it, with an excellent justification. On the contrary, I found that it works better than perplexity because the quality of the generated sentences strongly correlates to bleu’s score upon manual inspection.

Testing the Models

When the training is completed, you can create an endpoint for inference. From there, generating a text with inference can be done with a few lines of code.

response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
ContentType='application/json',
Body=json.dumps(payload))
response = response["Body"].read().decode("utf-8")
response = json.loads(response)
for i, pred in enumerate(response['predictions']):
print(f"Human: {sentences[i]}\nJarvis: {pred['target']}\n")

The results will not win me a medal (roughly 60% of the responses were out of context and 20% were passable). However, what excites me is that the other 20% were surprisingly good. It answered with correct context and grammar. Consequently, it showed me that it could learn the English language to a certain extent and could even swear like us. It’s a good reminder too, of what your kids can learn from movies.

One of my favourite examples is when the bot was asked, ‘who is your best friend?’ he answered, ‘my wife’. I have checked, and most of these sentences were not even in the training set, so the AI model indeed learned a little bit of creativity and did not just memorise the training sets.

Jarvis chat log (good examples)

Here are some of the bad ones.

Jarvis chat log (bad examples)

From experimenting with several different hyper-parameters, I found that:

  • Adding more encoder and decoder layers made it worse. Interestingly, the sentence structure that was generated was more complex. However, the relevancy of the answers to the questions was poor.
  • Reducing the word embedding size also dropped the generation quality.
  • Reducing the vocabulary size of up to 15k words increased the quality, whilst further reduction reduced the quality.
  • Expanding on word contractions and separating punctuation definitely increased the quality of responses.
  • Though it’s probably more of a pre-processing step rather than a hyper-parameter settings. It is worth to mention that adding Byte Pair Encoding (as suggested by SageMaker notebook) drops the quality of responses.

Speech Generation and Animated Avatar

The next modules I needed were a text to speech and an animated avatar. I used an awesome Amazon Polly for text to speech generation. Again, it is an online API. However, it has a super lighting respond time (less than 300ms most of the time) and high-quality speech that sounds natural.

Amazon Polly text to speech

Given my previous work as a Special Effect and Motion Capture Software Engineer, I was very tempted to build the animated avatar myself. I did actually build a simple avatar system on a separate project: Building a Bot That Plays Videos for my Toddler. Thankfully, the better part of me realised how long this would take me if I were to pursue this journey rather than to use an awesome 3D avatar software, Loom.ai, which comes with audio lip-sync capability! All you need to do is send an audio clip to the Loom.ai app and it will animate a 3D avatar to lip-sync to the provided audio. How awesome is that? The app also comes with a fake video camera driver, which streams the rendered output. I just needed to select the fake video camera in the Zoom app settings, so that I could include the animation in the Zoom meeting.

Jarvis animated avatar

Results

With all the modules completed, all I needed to do was build a controller system that would combine everything. The first test ran quite well, except that Jarvis stole all the conversation. He impolitely interjected and replied to every single sentence spoken by anyone in the video conference which annoyed everyone. Well, that’s one thing I forgot to train Jarvis in — manners and social skills!

Video conference chat with Jarvis

Rather than building another social skill module, which I had no clue how to start, an easy fix to this was to teach Jarvis to start and stop talking by a voice command. When he heard ‘Jarvis, stop talking’ he stopped responding until he heard ‘Jarvis, start talking’. With this new and important skill, everyone started to like Jarvis more.

This generative model was fun to talk to for the first few exchanges as with some of his answers, he amazed us with his creativity and his rudeness. However, the novelty wore off quickly because the longer the conversation went on, the more we heard out of context responses, until eventually we came to the realisation that we were not engaged in a meaningful conversation at all. For this reason, I ended up using a pattern-based chatbot tech driven by the AIML language model, which surprisingly offered much better creativity than a normal retrieval-based model and could recall context from past information! This article explains clearly what it is, and perhaps it will be a story for me to tell another day.

With the help of my super awesome Hackathon team members, we even extended Jarvis’ capability to be an on demand meeting assistant who can help taking down notes, action points assigned to relevant individuals, emailing them to meeting participants and scheduling a follow up meeting.

Conclusion

Sometimes the quality of the generated text from the bot was good. However, from just two conversation exchanges (even with the good ones), you could clearly tell that something was off and unnatural.

  • It did not have past context. Each response was generated from the context of the question asked. It did not consider prior conversations, which most humans can do very easily. Obviously, the technology is not there yet to build a model that can consider past conversation history. I have not heard of one yet at least.
  • For more than half of the time, it gave an irrelevant response. Although a better model architecture may also be needed, however my biggest suspect is the training set. Movie dialogue has broad topics, which makes it very hard for the AI model to learn, especially when the source and target sentences sometimes do not make sense (e.g., no prior context, requiring visual reference or being incorrectly paired). A training set like a restaurant booking dialogue may work better as the scope is very limited in restaurant booking. However, the conversation exchanges will likely be less entertaining.

Given the above, I believe a fully practical, generative chatbot is still years or decades away. I can now totally understand why it takes years for us to learn a language.

Future Generative Chatbot

Nevertheless, this was a very fun project and I learned a lot from it.

The code for this project is available from github.