Original article was published on Deep Learning on Medium
The implementation of the softmax differentiation requires us to iterate through the list of neurons and differentiate with respect to each neuron. Hence two loops are involved. Keep in mind that the purpose of these implementations is not to be performant, but rather to explicitly translate the math and arrive at the same results achieved by the built-in methods of Pytorch.
In the sequence of operations involved in a neural network, softmax is generally followed by the cross-entropy loss. In fact, the two functions are so closely connected that in Pytorch the method cross_entropy combines both functions in one.
I remember my first impression when I saw the formula for the cross-entropy loss. It was close to admiring hieroglyphs. After deciphering it, I hope you will share my awe towards how simple ideas can sometimes have the most complex representations.
The variables involved in calculating the cross-entropy loss are p, y, m, and K. Both i and k are used as counters to iterate from 1 to m and K respectively.
- Z: is an array where each row represents the output neurons of one instance. m: is the number of instances.
- K: is the number of classes.
- p: is the probability of the neural network that instance i belongs to class k. This is the same probability computed from softmax.
- y: is the label of instance i. It is either 1 or 0 depending on whether y belongs to class k or not.
- log: is the natural logarithm.
Let’s say we are performing a multi-class classification task where the number of possible classes is three (K=3). Each instance can only belong to one class. Therefore each instance is assigned to a vector of labels with two zeros and a one. For example y=[0,0,1] means that the instance of y belongs to class 2. Similarly, y=[1,0,0] means that the instance of y belongs to class 0. The index of the 1 refers to the class to which the instance belongs. We say that the labels are one-hot encoded.
Now let’s take two instances (m=2). We calculate their z values and we find: Z = [[0.1, 0.4, 0.2], [0.3, 0.9, 0.6]]. Then we calculate their softmax probabilities and find: Activations = [[0.29, 0.39, 0.32], [0.24, 0.44, 0.32]]. We know that the first instance belongs to class 2, and the second instance belongs to class 0, because: y = [[0,0,1],[1,0,0]].
To calculate cross-entropy:
- We take the log of the softmax activations: log(activations) = [[-1.24, -0.94, -1.14], [-1.43, -0.83, -1.13]].
- We multiply by -1 to get the negative log: -log(activations) = [[1.24, 0.94, 1.14], [1.43, 0.83, 1.13]].
- Multiplying -log(activations) by y gives: [[0., 0., 1.14], [1.43, 0., 0.]].
- The sum over all classes gives: [[0.+0.+1.14], [1.43+0.+0.]] = [[1.14], [1.43]]
- The sum over all instances gives: [1.14+1.43] = [2.57]
- The division by the number of instances gives: [2.57 / 2] = [1.285]
- Steps 3 and 4 are equivalent to simply retrieving the negative log of the target class.
- Steps 5 and 6 are equivalent to calculating the mean.
- The loss is equal to 1.14 when the neural network predicted that the instance belongs to the target class with a probability of 0.32.
- The loss is equal to 1.43 when the neural network predicted that the instance belongs to the target class with a probability of 0.24.
- We can see that in both instances the network failed to give the highest probability to the correct class. But compared to the first instance, the network was more confident about the second instance not belonging to the correct class. Consequently, it was penalized with a higher loss of 1.43.
We combine the above steps and observations in our implementation of cross-entropy. As usual, we will also go through the Pytorch equivalent method, before comparing both outputs.
Note: Instead of storing the one-hot encoding of the labels, we simply store the index of the 1. For example, the previous y becomes [2,0]. Notice, at index 0 the value of y is 2, and at index 1 the value of y is 0. Using the indices of y and their values, we can directly retrieve the negative logs for the target classes. This is done by accessing -log(activations) at row 0 column 2, and at row 1 column 0. This allows us to avoid the wasteful multiplications and additions of zeros in steps 3 and 4. This trick is called integer array indexing and is explained by Jeremy Howard in his Deep Learning From The Foundations lecture 9 at 34:57