Source: Machine Learning Mastery

Information theory is a subfield of mathematics concerned with transmitting data across a noisy channel.

A cornerstone of information theory is the idea of quantifying how much information there is in a message. More generally, this can be used to quantify the information in an event and a random variable, called entropy, and is calculated using probability.

Calculating information and entropy is a useful tool in machine learning and is used as the basis for techniques such as feature selection, building decision trees, and, more generally, fitting classification models. As such, a machine learning practitioner requires a strong understanding and intuition for information and entropy.

In this post, you will discover a gentle introduction to information entropy.

After reading this post, you will know:

- Information theory is concerned with data compression and transmission and builds upon probability and supports machine learning.
- Information provides a way to quantify the amount of surprise for an event measured in bits.
- Entropy provides a measure of the average amount of information needed to represent an event drawn from a probability distribution for a random variable.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

## Overview

This tutorial is divided into three parts; they are:

- What Is Information Theory?
- Calculate the Information for an Event
- Calculate the Information for a Random Variable

## What Is Information Theory?

Information theory is a field of study concerned with quantifying information for communication.

It is a subfield of mathematics and is concerned with topics like data compression and the limits of signal processing. The field was proposed and developed by Claude Shannon while working at the US telephone company Bell Labs.

Information theory is concerned with representing data in a compact fashion (a task known as data compression or source coding), as well as with transmitting and storing it in a way that is robust to errors (a task known as error correction or channel coding).

— Page 56, Machine Learning: A Probabilistic Perspective, 2012.

A foundational concept from information is the quantification of the amount of information in things like events, random variables, and distributions.

Quantifying the amount of information requires the use of probabilities, hence the relationship of information theory to probability.

Measurements of information are widely used in artificial intelligence and machine learning, such as in the construction of decision trees and the optimization of classifier models.

As such, there is an important relationship between information theory and machine learning and a practitioner must be familiar with some of the basic concepts from the field.

Why unify information theory and machine learning? Because they are two sides of the same coin. […] Information theory and machine learning still belong together. Brains are the ultimate compression and communication systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning.

— Page v, Information Theory, Inference, and Learning Algorithms, 2003.

### Want to Learn Probability for Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Calculate the Information for an Event

Quantifying information is the foundation of the field of information theory.

The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that are rare (low probability) are more surprising and therefore have more information those events that are common (high probability).

**Low Probability Event**: High Information (*surprising*).**High Probability Event**: Low Information (*unsurprising*).

The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.

— Page 73, Deep Learning, 2016.

Rare events are more uncertain or more surprising and require more information to represent them than common events.

We can calculate the amount of information there is in an event using the probability of the event. This is called “*Shannon information*,” “*self-information*,” or simply the “*information*,” and can be calculated for a discrete event *x* as follows:

- information(x) = -log( p(x) )

Where *log()* is the base-2 logarithm and *p(x)* is the probability of the event *x*.

The choice of the base-2 logarithm means that the units of the information measure is in bits (binary digits). This can be directly interpreted in the information processing sense as the number of bits required to represent the event.

The calculation of information is often written as *h()*; for example:

- h(x) = -log( p(x) )

The negative sign ensures that the result is always positive or zero.

Information will be zero when the probability of an event is 1.0 or a certainty, e.g. there is no surprise.

Let’s make this concrete with some examples.

Consider a flip of a single fair coin. The probability of heads (and tails) is 0.5. We can calculate the information for flipping a head in Python using the log2() function.

# calculate the information for a coin flip from math import log2 # probability of the event p = 0.5 # calculate information for event h = -log2(p) # print the result print('p(x)=%.3f, information: %.3f bits' % (p, h))

Running the example prints the probability of the event as 50% and the information content for the event as 1 bit.

p(x)=0.500, information: 1.000 bits

If the same coin was flipped n times, then the information for this sequence of flips would be n bits.

If the coin was not fair and the probability of a head was instead 10% (0.1), then the event would be more rare and would require more than 3 bits of information.

p(x)=0.100, information: 3.322 bits

We can also explore the information in a single roll of a fair six-sided dice, e.g. the information in rolling a 6.

We know the probability of rolling any number is 1/6, which is a smaller number than 1/2 for a coin flip, therefore we would expect more surprise or a larger amount of information.

# calculate the information for a dice roll from math import log2 # probability of the event p = 1.0 / 6.0 # calculate information for event h = -log2(p) # print the result print('p(x)=%.3f, information: %.3f bits' % (p, h))

Running the example, we can see that our intuition is correct and that indeed, there is more than 2.5 bits of information in a single roll of a fair die.

p(x)=0.167, information: 2.585 bits

Other logarithms can be used instead of the base-2. For example, it is also common to use the natural logarithm that uses base-e (Euler’s number) in calculating the information, in which case the units are referred to as “*nats*.”

## Calculate the Information for a Random Variable

We can also quantify how much information there is in a random variable.

For example, if we wanted to calculate the information for a random variable *X* with probability distribution *p*, this might be written as a function *H()*; for example:

- H(X)

In effect, calculating the information for a random variable is the same as calculating the information for the probability distribution of the events for the random variable.

Calculating the information for a random variable is called “*information entropy*,” “*Shannon entropy*,” or simply “*entropy*“. It is related to the idea of entropy from physics by analogy, in that both are concerned with uncertainty.

The intuition for entropy is that it is the average number of bits required to represent or transmit an event drawn from the probability distribution for the random variable.

… the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits […] needed on average to encode symbols drawn from a distribution P.

— Page 74, Deep Learning, 2016.

Entropy can be calculated for a random variable *X* with *K* discrete states as follows:

- H(X) = -sum(i=1 to K p(K) * log(p(K)))

That is the negative of the sum of the probability of each event multiplied by the log of the probability of each event.

Like information, the *log()* function uses base-2 and the units are bits. A natural logarithm can be used instead and the units will be *nats*.

The lowest entropy is calculated for a random variable that has a single event with a probability of 1.0, a certainty. The largest entropy for a random variable will be if all events are equally likely.

We can consider a roll of a fair die and calculate the entropy for the variable. Each outcome has the same probability of 1/6, therefore it is a uniform probability distribution. We therefore would expect the average information to be the same information for a single event calculated in the previous section.

# calculate the entropy for a dice roll from math import log2 # the number of events n = 6 # probability of one event p = 1.0 /n # calculate entropy entropy = -sum([p * log2(p) for _ in range(n)]) # print the result print('entropy: %.3f bits' % entropy)

Running the example calculates the entropy as more than 2.5 bits, which is the same as the information for a single outcome. This makes sense, as the average information is the same as the lower bound on information as all outcomes are equally likely.

entropy: 2.585 bits

If we know the probability for each event, we can use the entropy() SciPy function to calculate the entropy directly.

For example:

# calculate the entropy for a dice roll from scipy.stats import entropy # discrete probabilities p = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] # calculate entropy e = entropy(p, base=2) # print the result print('entropy: %.3f bits' % e)

Running the example reports the same result that we calculated manually.

entropy: 2.585 bits

Calculating the entropy for a random variable provides the basis for other measures such as mutual information (information gain).

It also provides the basis for calculating the difference between two probability distributions with cross-entropy and the KL-divergence.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Books

### Chapters

- Section 2.8: Information theory, Machine Learning: A Probabilistic Perspective, 2012.
- Section 1.6: Information Theory, Pattern Recognition and Machine Learning, 2006.
- Section 3.13 Information Theory, Deep Learning, 2016.

### API

### Articles

- Entropy (information theory), Wikipedia.
- Information gain in decision trees, Wikipedia.
- Information gain ratio, Wikipedia.

## Summary

In this post, you discovered a gentle introduction to information entropy.

Specifically, you learned:

- Information theory is concerned with data compression and transmission and builds upon probability and supports machine learning.
- Information provides a way to quantify the amount of surprise for an event measured in bits.
- Entropy provides a measure of the average amount of information needed to represent an event drawn from a probability distribution for a random variable.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Information Entropy appeared first on Machine Learning Mastery.