Demystifying Focal Loss I: A More Focused Version of Cross Entropy Loss

Source: Deep Learning on Medium

Focal loss vs. Cross entropy loss

I will restrict the discussion to binary classes here for simplification, because extension to multiple classes is straightforward.

Figure 2: binary cross entropy loss

Figure 2 shows binary cross entropy loss functions, in which p is the predicted probability and y is the label with value 1 or 0. The p value here is for label 1. If p is 0.7, it means that the predicted pixel has 0.7 probability of being 1 and 0.3 probability of being 0. If we summarize the two equations into one, we get the following equation:

Figure 3: rewrite binary cross entropy

in which p and 1-p are replaced by pt in Figure 4.

Figure 4

When analyzed in detail, if label y is 1, then pt is p, and a bigger pt corresponds a bigger p, which is nearer to label 1. If label y is 0, then pt is 1-p, and a bigger pt corresponds to a smaller p, which is nearer to label 0. As a result, pt represents the accuracy of predication: the bigger the pt is, the better the prediction will be.

Don ‘t forget our objectives: decreasing the loss of the pixel if its prediction is already good. Since pt is a measurement of prediction accuracy, why not use it to decrease the loss? This naturally introduces the following focal loss function [Lin et al. 2018]:

Figure 5: focal loss

As Figure 5 shows, focal loss is just an evolved version of cross entropy loss by adding the part surrounded in red box. Here, 1-pt is used as a factor to decrease the original cross entropy loss, with the help from another two hyper parameters: alpha t and gamma. As discussed above, if pt gets bigger and is close to 1, 1-pt gets smaller and is close to 0, thus the original cross entropy loss is largely decreased. If pt gets smaller and is close to 0, 1-pt gets larger and is close to 1, thus the original cross entropy loss is trivially decreased.

Figure 6: experiment results of original paper

In the experiments of original paper, the authors tested different gamma values and set alpha t to 1 as shown in Figure 6. We can easily see that the higher the gamma is, the lower the training loss will be. By leveraging focal loss, the model can focus on training the pixels that have not been well trained yet, which is more effective and purposeful.