How to teach your laptop to speak if you don’t have a parakeet

Source: Deep Learning on Medium

I had some executive time and decided to see if I could teach a computer to lead the free world. Or at least sound like it was leading the free world. Or at least infer a generational probabilistic model through deep recurrent neural network training which would allow consecutive character pattern sampling in the style of the pair of thumbs belonging to the current leader of the free world.

I wanted to teach my laptop to tweet like Donald Trump.

Can my laptop speak?

Natural Language Processing is a tricky thing for machines. Computers are very good at getting from number A to number B. But as someone who yells at their computer on a semi-regular basis, I can tell you they’re not so great at understanding words. Words pose several problems when it comes to machine learning.

First, computers are designed to work with numbers, and machine learning algorithms are designed to work on computers. They rely on estimating probability distributions, multiplying matrices and exploring through multi-dimensional geometric spaces, all of which are represented with numbers. Numbers are stored in machines more or less as they are, and you can perform standard operations on them in a straightforward way. Words are harder to add and subtract. Can you add the words “Crooked” and “Hillary” together? It turns out you can, sort of. Models that can do that are commonly referred to as “word2vec”. See for an overview of word2vec models. But you need to transform the words into numbers, which can be done in several different ways. This choice will lead to different behavior in the algorithms down the line.

The second problem is defining what we want the computer to “understand”. How do you not only capture the meaning of a word, but capture it with a number? What would this meaning actually be? In practice much of machine learning pretty much ignores the philosophical questions here (to my knowledge) in order to find a functional workaround, but these questions are worth considering every so often. A number itself is a relatively straightforward object, at least where computers are concerned, and people and computers can both agree on the meaning of the number 5. Its interaction with other numbers is defined at a fundamental level by basic operations, such as addition and multiplication. Words, on the other hand, take on radically different meanings in different contexts. “New” in “My new car” and “I live in New York” is the same word, but in the first case specifies that I am referring to my Lotus and not my Ferrari (I can dream about problem to have right?), and in the second refers to the best place to find a rat chasing a slice of pizza. Even more confusing for machines is punctuation. Most of us are aware that there is a strong distinction in meaning between “Let’s eat, Grandma” and “Let’s eat Grandma”, but the word ordering is the same.

Machine learning gets around the philosophical quagmire by designing algorithms that recognize statistical patterns. Basically, each word is assigned a number, and we look for consistent patterns in these numbers. If the models are designed to be flexible enough, and they’re shown enough examples, they can derive patterns that appear consistently in the examples they were shown. In the case of language modeling, the pattern we are trying to identify is the probability distribution of words, by a certain person. Once we know how probable a word or a sentence is, we can randomly generate new samples that follow this same distribution, and if done properly this should lead us to sound like the person who generated the original examples. This is easier said than done. How do you model such a distribution? What is the probability of “covfefe”?

First give it a brain

There are plenty of ways to at least estimate this distribution which would fall under the category of AI. I could count all the word frequencies in public records of Trump’s discourse, and sample random words at these frequencies one after the other. I could develop a human-level general artificial intelligence, an animatronic robot and a time machine, and send the resulting entity back to post-war Queens to be raised by a real-estate magnate.

But counting words is boring, and the only place I have known both a general AI in an animatronic body and a time machine to come from is Tony Stark’s garage. So I settled for the next best thing: Recurrent Neural Networks, or RNNs. Neural networks are cool, but when you make them recurrent you make them cooler. Because now there is an extra letter in the acronym.

RNNs are built of neurons specifically designed for time-series data: data that comes in a given order, where you only see one piece at a time. This is makes them particularly suited to looking at sentences, as they can “read” the words in order one by one. The graph below illustrates what this looks like : each x is a word, and they are introduced one by one into the network. h represents the internal computation of the network at each step, and o is the output. — Chapter 10

However, especially when it comes to language, memory is essential. Not only for knowing vocabulary, but for remembering what the last few words you read were. This helps make sense of the word currently being read. So, I used a special variety of recurrent neurons in the network known as a Long Short-Term Memory cells, or LSTMs. These neurons are designed to not only look at and try to “understand” a word, but also update an “internal state”, which acts like a memory of past events seen.

Here is what the standard LSTM cell looks like. Notice that it now takes not only a word x and previous computation h, but also stores a “memory” in c that is passed on from one cell to another.

Speak, my child

Setting up a neural network to do something useful (or in this case at least mildly amusing) takes more than just a model with fun acronyms. You also need a machine learning task with fun acronyms. Once the network is set up, it is trained to act as a Generative Language Model. The network reads each word, and at every word is asked to predict the one after it. For example, the model will be given “The Failing New York Times”. It reads “The”, and is asked to predict “Failing”. It then sees “The Failing”, and is asked to predict “New”. It then sees “The Failing New” and should predict “York”. It continues this for every sentence it is given. Once it has been trained to do this, all you have to do is give it the beginning of a sentence and ask it to complete it. If you want it to come up with something on its own, give it nothing and ask it to fill in the blanks.

These networks take a bit of computational power. The most successful language models are trained with millions of parameters over billions of words, quantities of data not yet reached by Donald Trump. The models are theoretically capable of anything, literally, given enough computing resources and enough data. In practice however, this requires vast amounts of computing power and even larger amounts of data. For this given project I was really too lazy to get more data than Trump’s tweet record, and that amount can be handled pretty well on a laptop.

An arguably better way would be to get a huge dataset of tweets, say 1 million or so, and train a model with maybe 10 million parameters on those. Afterwards I could tweak the model using only the person I wanted to imitate. Think of this like first teaching the model to speak English (or whatever people speak on twitter), before telling it how to imitate someone. It’s easier to learn nuances of style once you understand what phrases constitute nuance and what is generic English.

For context, previous state-of-the-art generative language models used from 13 million to 90 million parameters, and the latest one by OpenAI called the GPT-2 has 1.5 billion. With a ‘b’. It’s powerful enough that they decided to not actually publish the model, because text it generated sounded too plausible. And imagine a world where fake news could be automatically generated by a computer. Their post the model and policies around it is a good read. Their examples are also a lot of fun, my personal favorite being a convincing report of scientists discovering a valley of unicorns in South America.

My model is fairly small, only around 1 million parameters, and I could train it on my laptop. It looks like this:

Layers in language model — Keras output

Here is what those layers do, one by one.

  1. The input layer takes a series of words and turns it into a series of numbers.
  2. The embedding layer turns each series of numbers into a series of vectors, which allows the model to be more flexible in its internal representation of data as each word now has several numbers instead of just one. My model represents words as vectors with 256 distinct coordinates, so each word now gets their own unique set of 256 numbers. Because the model gets to “learn” how to solve the task, it now as 256 different numbers it can tweak, for each word. We allow it to define the best way for it to represent the words, or “imagine” them if you will, and give it greater flexibility in doing so, before passing it into the language model.
  3. The stacked LSTM layers are the cool sequential neurons I mentioned earlier. This is the part where the model starts taking inputs and predicting outputs. The LSTMs are essentially the “comparison” part, where the network evaluates elements in the sequence.
  4. The last Time Distributed layer is a sort of shortcut in Keras, the deep learning framework I used, where I get to tell the model to apply the same transformation across all steps in time. This is used for the prediction step. I define the operation for predicting the next word, and it automatically distributes this across the entire sequence. It helps to define things this way because of how neural networks are represented on the computer: each network is built as a “computational graph”, where neurons representing operations are linked to one another.

What does my laptop have to say?

Honestly not much. Not incredibly surprising, as this model is very simple compared to state-of-the-art approaches to the problem, and I didn’t expect anything earth-shattering. But a few did make me chuckle a bit:

  • “Geraldo is just a low big threat for the future. He lost the raise you need. You will be going to soon!”
  • “@realDonaldTrump Who is the best films for me? “ Can be a great show.”
  • -“@NeAQ8xZkgy We need you — you talk more than have no clue.”

Predicting one character at a time is actually not bad either:

  • “Met’s said that visit in Canada Repain man are on years & Republicans in Donny, who many military next very election, in its. First every idea, we will #MyLain”
  • -“Few must used, that is the day — frust right to do it being many havings and losiques to the 201 terrific speech, making the whouse!”
  • -“@Reperbeck was making my canced representers, is an immaining a very impicial who he’ll has at coming his families and restrating event that hates on Email Cimenicina, and thank you. She is in that it would go win — its #IPine just build the best history). I will #MakeAmericaGreatAgain”

I was particularly happy to see that the model learned words from characters for the most part, as the second model only ever saw one letter at a time. It even learned to predict the phrase “I will #MakeAmericaGreatAgain”, character by character, which I though was really cool. But the tweets don’t really make sense as a whole.

My guess is because it has many more examples of words than entire tweets, so it learns to be very good on those. It also gets less confused when it’s learning, because it only has to choose between 317 distinct characters at each step instead of 52,347 distinct words.

I’m working on a more technical version of this post, and in the meantime if you’re interested you can find the code on github at are people who have done this more seriously,,as well as more recent language generation models using more sophisticated techniques. The OpenAI post I mentioned earlier is quite good in this regard.

In any case I had fun. Who knows, maybe my computer will actually make sense someday.