Original article can be found here (source): Artificial Intelligence on Medium
Building an AI to talk to people for me
Building a neural network consists of 3 phases — preprocessing, training, and finally sampling. To keep this article short I’m not going to go into details here, but if you’d like to learn how this all works there’s a series of articles entitled Machine Learning is Fun that does a much better job of explaining things than I can. There’s also another very important step before preprocessing — before we can start training a neural network we need some data to train it on.
What we’re essentially trying to do is imitate what I say in online messages. For the best results doing that, we’ll need as many examples as possible of conversations I’ve had online. While it’s possible to extract this data from Slack, I’m also quite a heavy WhatsApp user and there happens to be several tools out there for extracting your messages from it. My genius plan is as follows:
1. Dump my WhatsApp db
2. Extract the conversation data
3. Feed it to Torch-rnn
Well, here goes nothing.
Dumping my WhatsApp database
Turns out this is fairly straightforward if you have a rooted Android device.
The database, encryption key, and contacts storage are located in the folders below. For convenience I used Total Commander to copy them to a more accessible folder, and then copied them over to my machine from there.
Now that’s sorted, on to getting some useful data out of it!
Decrypting my WhatsApp database
This is also fairly straightforward, if I’m willing to trust an online tool to do it for me. This one was made by the XDA user Xorphex.
It’s a risk I’m willing to take. I don’t see any data in my browser’s debug network tab, and I haven’t seen my chat logs appear for sale on any darknets, so I’m going to naively assume this was safe to do.
At the end of the process, I’m left with a SQLite database dump.
Extract everything I’ve ever said
To build a proper conversational chatbot, I’d ideally feed it full conversations. Seeing as the data’s from WhatsApp as opposed to a publicly accessible chat, it’s not really ethical to use what other people have said to me in private conversations without their consent. It’s also a GDPR violation, and we wouldn’t want that.
I’ll settle for the next best thing, and stick with everything I’ve ever said:
sqlite> .output ./msg_dump
sqlite> select data from messages where key_from_me=1 order by timestamp desc;
~19,000 lines dumped
The more data we get, the better. 19k lines should be sufficient.
Preprocess the data
Inspired by the blog AI Weirdness, I’m going to use torch-rnn for this because once again, it’s fairly straightforward. There’s also a Dockerised version of it that makes setup a lot easier.
# docker run –rm -v ~/dev/HACKS/whatsapp-bot:/data -ti crisbal/torch-rnn:base bash
# python scripts/preprocess.py \
# –input_txt /data/msg_dump \
# –output_h5 data/msg_dump.h5 \
# –output_json data/msg_dump.json
This process is known as “tokenisation”. What this does is take our conversational data and convert it into a series of vectors (or “tokens”), which are ready to use as input to a neural network. In this case, each dimension in the vector represents the subset of all UTF-8 characters ever used in conversation.
TL;DR: We convert the letters into numbers, so that our neural network can read them.
Train the neural network
Just like the biological brains it was inspired by, a neural network needs time to learn by trial and improvement. We call this “training”, and it’s often the most lengthy part of the process.
As a test run, I used the default network structure of 2 layers of size 128 with a word vector of size 64, trained for 500 iterations. From an initial validation loss for 5, it got down to 1.89. We’ll use those loss scores for comparison later.
# th train.lua -input_h5 /data/msg_dump.h5 -input_json /data/msg_dump.json -gpu -1 -checkpoint_every 100 -rnn_size 512 -num_layers 3 -checkpoint_name /data/cv/checkpoint_512_3
It took about an hour and made my laptop fan go nuts.
Saying our first words
Now that the model is trained, let’s get it to spit out some sentences at us. This is known as “sampling” and is reasonably fast, especially compared to the time and effort it took to train. With any luck it will sound exactly like me!
# th sample.lua -checkpoint /data/cv/checkpoint_128_2_500.t7 -length 2000 -gpu -1
Aw, a. Gad the, G harks to couplich maket
Jucgle, goo sved ardis jo got igh, toall! Shigkirsle
Tkick anmit.) it go aldetting thes there. Cow’s coout, Frepal(/my time wesn Sywest of a kod wike latgne day erring!
osifrecough/thars for llerlight Wike sbeitherbart the tilbed raay nater you on mree.!
Still, it seems to have got the concept of words, sentences, and punctuation. Not bad for a test run. With more time and neurons, it might be able to learn English.
Throwing lots of time and neurons at it
This really strained my laptop. I left it running over a long weekend and prayed it didn’t catch fire.
# th train.lua -input_h5 /data/msg_dump.h5 -input_json /data/msg_dump.json -gpu -1 -checkpoint_every 200 -rnn_size 512 -num_layers 4 -wordvec_size 128 -dropout 0.01 -checkpoint_name …
It’s quite an upgrade from earlier, using 4 layers of size 512, with a word vector size 128 and a 1% dropout rate over 15450 iterations. Compared to our test run’s loss of 1.89, we got down to 1.39. It doesn’t sound like much, but as we’ll see it gives us significantly better results.
The crawl sausages will possibly cut in a pig quick. The things not as to hang and it died
This time around, it’s been able to learn English. You can actually see this in the improvement in the loss score — in training it plateaued around the 1.8 mark for a while, before making rapid improvements down to 1.5 and eventually leveling off around 1.4. My speculation is that the plateau corresponded to the time between learning words and general sentence structure, and learning to make those words sound like English, which brings with it an improvement in accuracy.
With this network trained, let’s generate some amusing samples!
I thinks anneal more isn’t the chances
I just drinked trip for somempemper as Block like ahtwith filming to get void are you for guidhing thoill nitery fams it!
There’s plenty of short sentences that almost sound like something I’d say.
It’s even learnt to use emojis.
Some sentences are short, but sound exactly like something I’d say.
Some of them don’t.
Uur and talk you still got half that later.?
Anyone tickets, but it appoy?
Longer sentences seem to be a lot harder.
Aye – exit on their day. Its an elploded sound like is $2 is giving me a bit better together
Dammit, I forgot that one guys going to burn actually a lot of use of the outphour hard and got some you guys up to CUrom
Argh – I can’t imagine a lot of making deprived, I went at 4, but gloser to staying a bit stuff if it makes water in the attack when I’m back. This might cake week?
We should be home soon – not sure you are would be a bit of if there’s speculation attack
I’m not sure what to make of some of it.
Next time, I don’t think I’m going to try!
So, is it possible to train an AI to talk to people for me?
No. Or at least, not with the technology I’m using.
While it’s entertaining to try and train a neural network to imitate myself, imitation is all it is. It’s like a parrot — it can talk, but it doesn’t understand what it’s saying.
There are uses for something like this though. Being able to predict what letter comes next in a sequence underpins a lot of predictive typing systems, your phone keyboard likely makes use of (a somewhat better trained) one every day.