Source: Deep Learning on Medium

Cross entropy loss intuition
Photo by Adrien Olichon .
Given an input (and some classes), determine/predict the class to which the input belongs: this is a classification problem. Essentially, a model for probabilistic classification maps inputs to probabilistic predictions. For example, if we have 3 classes {(A)irplane, (B)oat, (C)ar}, our model might accept an image as a single observation input and return three numbers representing the probabilities of classes A, B and C as output. During training, we might feed an image of a car as input hoping that the model produces predictions that are close to the true (observed) class probabilities

If our model predicts a different distribution, say p = [0.5, 0.3, 0.2], then we would like to modify the model parameters so that the predictions p get closer to the true/observed probabilities t . The model is trained by adjusting the model’s parameters so that the predictions get closer and closer to true probabilities. But in order to do so we have to understand first how to quantify the distance between two distributions like p and t .

A possible loss function
Consider a classification problem for which we have some fixed model (the hypothesis) that gives predictions with respect to the disjoint classes

{c_ 1, c_ 2, …, c_n }.

The model predicts that the probability that the single input belongs to c_ 1 is

the probability that it belongs to c_ 2 is

and so on. Therefore, the model returns a distribution

This is our hypothesis.

But what happens in the real world? The observed true data (the single instance we feed as input) belongs to some class c_j ; hence the observed data determine a one-hot distribution

This vector has all entries equal to zero except the one corresponding to the correct class (the j -th), which is equal to one (more generally,

is the number of instances of class c _i ). The likelihood of a model is the probability of the data given the model:

If likelihood relative to model 1 is larger than likelihood relative to model 2, then model 1 is more plausible than model 2.

If (as in our case) t is a one-hot vector, then likelihood is simply

where the s -th component of t is 1.

To avoid small numbers (too close to zero, especially when there are many observations) and to transform products in additions, we can take the logarithm of the likelihood. We also change the sign to get positive values: in this case we seek small values (the lower the better):

More generally, when t is not a one-hot vector, this last expression often reads as

where

We define the cross entropy loss (also written cross-entropy loss) as

Cross entropy is a measure of the difference between two probability distributions.

Example
In the context of machine learning, frequently, the real observed or true distribution (the one that a machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution. Suppose that, out of the possible labels A, B, and C, for a specific training instance, the label is C. The distribution (one-hot distribution) for this training instance is therefore:

Suppose that your machine learning model predicts the following probability distribution:

How close is the predicted probability distribution p to the true distribution t ? That is what the cross-entropy loss measures. Using the formula, we get:

The cross entropy loss is greater than or equal to zero and the minimum loss is achieved (a simple consequence of Gibbs’ inequality ) when p = t , that is, when the machine learning model exactly predicts the true distribution. It is worth noting that in the case of one-hot true distributions (as in the above example), this minimum value is zero. In the general case, the minimum value attained when t = p is equal to the entropy

(click here for details).

Cross entropy is a loss function. We can use loss functions (usually written as J or L ) within gradient descent, an iterative algorithm used to drive the model parameters towards the optimal values.

Original article here .