Original article was published on Artificial Intelligence on Medium

# The relationship between Perplexity and Entropy in NLP

## Use Information Theory to understand NLP Metrics

Perplexity is a common metric to use when evaluating language models. For example, scikit-learn’s implementation of Latent Dirichlet Allocation (a topic-modeling algorithm) includes perplexity as a built-in metric.

In this post, I will define perplexity and then discuss entropy, the relation between the two, and how it arises naturally in natural language processing applications.

# Context

A quite general setup in many Natural Language tasks is that you have a language *L* and want to build a model *M *for the language. The “language” could be a specific genre/corpus like “English Wikipedia”, “Nigerian Twitter”, or “Shakespeare” or (conceptually at least) just a generic like “French.”

Specifically by a language *L, *we mean a process for generating text. For clarity, we will consider the case where we are modeling sentences and the text consists of sequence words ending with an end of sentence “word.” But you can replace “word” with “token” and “sentence” with “document” to generalize to any context.

What is a “process”? For our purposes, we can think of a process as a collection of probability distributions. Given a history *h* consisting of a series of previous words in a sentence, the language L is the probability that the next word is *w:*

For example, I am willing to wager that if L is “English”:

- L(dog | The quick brown fox jumps over the lazy brown) ≈ 1
- L(ipsum | Lorem) ≈ 1
- L(wings | Buffalo buffalo buffalo Buffalo buffalo) ≈ 0

Similarly, given an entire sentence *s*, we can evaluate L(*s*) the probability of the sentence occurring. If we include a special beginning of sentence “word” wₒ and let the n-th “word” be the end-of-sentence “word”, we get

However it is common to leave out the first term in the product as well, or sometimes to work with an even longer starting context.

It is surprisingly easy to get a perfect replica of *L *of (say) spoken American English. Just flag down any native English speaker walking down the street. Of course, we are usually interested in teaching a computer the model (hence, Machine Learning). So we will let *M *be whatever language model we have managed to build on a computer.

This setup, with a language *L* and model *M *is quite general and plays a role in a variety of Natural Language tasks: speech-to-text, autocorrect, autocomplete, machine translation – the list goes on. Autocomplete is the most obvious example: given the words someone has typed so far, try to guess what they might type next by picking the highest-probability completion.¹

# Perplexity

Given a language model M, we can use a held-out dev (validation) set to compute the perplexity of a sentence. The perplexity on a sentence *s *is defined as:

You will notice from the second line that this is the inverse of the *geometric mean* of the terms in the product’s denominator. Since each word has its probability (conditional on the history) computed once, we can interpret this as being a *per-word *metric. This means that, all else the same, the perplexity is not affected by sentence length.

In general, we want our probabilities to be high, which means the perplexity is low. If all the probabilities were 1, then the perplexity would be 1 and the model would perfectly predict the text. Conversely, for poorer language models, the perplexity will be higher.

It’s hard to provide a benchmark for perplexity because, like most Natural Language tasks, the metric is highly dependent on the vocabulary size. Given a corpus, a smaller vocabulary means that other words will all be replaced with an <oov> (out-of-vocabulary) token, instantly increasing the apparent quality of any language model trained on it

Here are some benchmarks:

**State of the Art.**For WikiText-103, a selection of ~28,000 high-quality Wikipedia articles and a large (0.4% OOV rate) vocabulary, the state-of-the-art perplexity for a language model (as of this writing) is**10.8**.**Worst-case-scenario.**On any dataset, the baseline model is to just guess a word in the vocabulary randomly with equal probability for each. In this case, the perplexity is just the vocabulary size:**267,735**for WikiText-103, but substantially smaller for WikiText-2 (**33,278**). 30,000, in general, is a pretty reasonable size for a language model’s vocabulary.**Best-case-scenario.**I said above that the “best” possible perplexity is 1. But for that to be true, there could only be one possible sentence in a language, which is quite boring.² A recent paper exploring text-generation uses OpenAI’s GPT-2 (large version with perplexity 22.1 on WikiText-103). On their dataset of choice (WebText, which GPT-2 was trained on), they find a perplexity of**12.4**. But, crucially, they find that, while their model is capable of generating text with much lower perplexity (1.5!), the generated text is either repetitive or incoherent. Staying closer to human perplexity is better!

This last point is very important. **There is a lower bound** on perplexity fixed by the language itself. We will see this mathematically below. But this points to a general feature of metrics in NLP: an easy-to-evaluate metric like perplexity is not necessarily the best predictor of the true performance of a model. Perplexity is good for development (validation) but not necessarily for evaluation. The gold standard for evaluation remains human evaluation.

# Entropy

Entropy is a slippery concept in physics, but is quite straightforward in information theory. Suppose you have a process (like a language *L *that generates words). At each step in the process, there is some probability *p *that the thing that happened (the event) was going to happen. The amount of **surprisal **is –log(*p*) where the logarithm is taken in any base you want (equivalent to changing units). Low probability events have high surprisal. Events that were certain to happen (*p*=1) have 0 surprisals. Events that are impossible (*p*=0) have infinity surprisal.

The entropy is the expected value of the surprisal across all possible events indexed by *i*:

So, the entropy is the average amount of surprise when something happens.

Entropy in base 2 is also optimal number of bits it takes to store the information about what happened, by Claude Shannon’s source coding theorem. For example if I told you that a full-length tweet of 280 characters had an entropy of 1 bit per character, that means that, by the laws of mathematics, no matter what Twitter does, they will always have to have 280 bits (35 bytes) of storage for that tweet in their database. (In practice of course, they have to have quite a bit more).

In the context of our language model, we’ll have to make one tweak. Given that we are interested in sentences *s* (sequences of events) of length *n*, we’ll define the entropy rate per word (event) as: