Source: Deep Learning on Medium
This article covers the content discussed in the Information Theory module of the Deep Learning course and all the images are taken from the same module.
In the previous article, we discussed how in the context of classification we can represent true output and the predicted output as a probability distribution. In this article, we will see a new way to compute the difference between the two distributions or to say compute the loss value for the distributions.
Once again, think of a situation where we have 4 teams A, B, C and D and there is random variable X which denotes the winning team.
And say based on some past experience, we come up with the probability distribution of these teams winning the tournament as below:
Let suppose the output for you associated with the winning of each of the teams is as follows(if team A wins, you get 10k from a bet, if team B wins, you get 2k, if team C wins you lose some 8k, if team D wins, you get 5k):
So, there is some probability associated with each of the random variables and some profit/loss(Gain) associated with each of the possible values of the random variables. The expected return we can compute as a summation of the product of the probability and the associated gain for all the values that the random variable can take:
So, the above value is the Expected Return value as per the probability distribution and the Gain/Loss associated with each of the values that the random variable can take.
The intuition behind Information Content:
Let’s take an event which is where does the sun rises so let’s say the random variable is where does the sunrise? And it can take 4 values: East, West, North, South.
And now if we tell that the Sun rose in the East today, then there is nothing much information gain because this is a certain/sure event that the sun rises in the East which happens with probability 1. So, there is no information that we are gaining from this.
Let’s take another event. Suppose we tell you today there is going to be a Storm, so let’s say the random variable Y can take on the value of Storm or No Storm.
Now if we get to know that today there is going to be a storm, in this case, the information gain would be very high because regularly there is no storm. A storm is something like a very very rare event and if we tell you about the rare event, then the information gain is really high. There is a lot of surprise for you and the surprise leads to information gain.
Based on the above, we could say Information Gain is directly proportional to the Surprise in the event.
Now since we are discussing probability, so the surprise equivalent in probability can be deduced as: A surprising event is the one which has a low probability. So, low probability means high surprise. So, we can say Information Content is inversely proportional to the Probability of the event.
The more surprising the event, the less the probability and the less the probability the more information we gain by knowing about it.
Now let’s consider another scenario in which a tournament is going on and the random variable X tells which team is going to win from A, B, C, or D and there is this another random variable Y which tells whether the A.C in this room is ON or OFF.
Now suppose we tell you that team B won and the A.C in this room is ON. The first thing to note here is that these two random variables are independent, A.C being on or off in this room has no bearing on the match outcome and similarly who is going to win the match has no bearing on whether the A.C in this room is on or off. So, these are completely independent events. So, if we tell about two independent contents, then what is Information Content going to be:
We are gaining some information by knowing that B has won the match and we are also gaining some information by knowing that A.C is on. So, the total information content by knowing both of these(independent events) should just be the summation of the individual information content(we know that the information gain is a function of the probability of the event and is inversely proportional to probability)
So, now we have this interesting situation, we have this Information Content as a function such that it satisfies the below criteria:
The above criteria is satisfied by the Logarithmic family of functions. So, in essence, we have:
which we can re-write as:
Let’s say we have a random variable X which can take on values A, B, C, D.
Now based on the above table, we can say that the Expected Gain is given by:
Entropy(H(x)) is the expected information content of a random variable and is given as:
Relation to Number of Bits required to transmit a message
Suppose there is this X(message) that we would like to transmit and the message can take on 4 values A, B, C, or D.
We can use 2 bits to transmit 4 messages:
So, in all, for every message we are transmitting, we are using 2 bits.
In this case, we are assuming that all of these messages are equally likely that is their probability is 1/4. So, this is the distribution that we are assuming for Random Variable taking on any of the 4 values it can take.
Let’s see the information content for each of the messages:
Now we can make this connection that the number of bits required to transmit a message is equal to the Information Content of that message.
Let’s consider this for 8 messages(A, B, C, D,…., H)
Now if we want to send a continuous stream of messages and we want to minimize the number of bits required to do so. Let’s say we have a different distribution, say message A is the most frequent message and so on.
If we use less no. of bits to send the message A(most frequent) then it’s wiser because on average we will be using fewer bits as A is the most frequent message. The number of bits required to send each of the messages would be(assuming the probability/frequency of each message as in the below image):
The only way we can say this strategy is better is if on average we end up using less no. of bits(even if we are using 3 bits for less frequent messages)
Here, Information Content is same as no. of bits, so if we want the average number of bits it would be:
So, the entropy of a random variable tells us the average number of bits required to send that random variable.
KL Divergence and Cross-Entropy
Let’s say X is the random variable and y is the true distribution of the random variable
Now think of this case where there is some source which is transmitting a lot of messages from one source to one destination and there is some true probability associated with these messages given by y and we don’t know this true distribution in advance and you are looking at some of these messages and you are estimating a y_hat(predicted distribution), so let’s say the predicted distribution looks like:
The true entropy of the random variable would be:
But we have predicted some distribution/probability for the random variable and as per that the information content would look like:
The actual messages at the destination would come as per the true distribution y, so the predicted no. of bits that we end up using is going to be:
the value associated with each of the values that the random variable can take is
because that is what we have estimated but the messages actually are going to come by probability as per true distribution.
So, if we knew the true distribution, then the number of bits that we would require would have been
but we didn’t know the true distribution, so we ended up using an estimated distribution and then the no. of bits required was
The above is the cross-entropy between two distribution y and y_hat.
So, now we have the bits required to transmit a message as per the true distribution and as per the estimated distribution and we can now compute the difference(this difference is known as KL Divergence) between these two as:
And this KL Divergence provides a way of computing the difference between two distributions.