# Demystifying Focal Loss I: A More Focused Version of Cross Entropy Loss

Source: Deep Learning on Medium Focal loss vs. Cross entropy loss

I will restrict the discussion to binary classes here for simplification, because extension to multiple classes is straightforward.

Figure 2 shows binary cross entropy loss functions, in which p is the predicted probability and y is the label with value 1 or 0. The p value here is for label 1. If p is 0.7, it means that the predicted pixel has 0.7 probability of being 1 and 0.3 probability of being 0. If we summarize the two equations into one, we get the following equation:

in which p and 1-p are replaced by pt in Figure 4.

When analyzed in detail, if label y is 1, then pt is p, and a bigger pt corresponds a bigger p, which is nearer to label 1. If label y is 0, then pt is 1-p, and a bigger pt corresponds to a smaller p, which is nearer to label 0. As a result, pt represents the accuracy of predication: the bigger the pt is, the better the prediction will be.

Don ‘t forget our objectives: decreasing the loss of the pixel if its prediction is already good. Since pt is a measurement of prediction accuracy, why not use it to decrease the loss? This naturally introduces the following focal loss function [Lin et al. 2018]:

As Figure 5 shows, focal loss is just an evolved version of cross entropy loss by adding the part surrounded in red box. Here, 1-pt is used as a factor to decrease the original cross entropy loss, with the help from another two hyper parameters: alpha t and gamma. As discussed above, if pt gets bigger and is close to 1, 1-pt gets smaller and is close to 0, thus the original cross entropy loss is largely decreased. If pt gets smaller and is close to 0, 1-pt gets larger and is close to 1, thus the original cross entropy loss is trivially decreased.

In the experiments of original paper, the authors tested different gamma values and set alpha t to 1 as shown in Figure 6. We can easily see that the higher the gamma is, the lower the training loss will be. By leveraging focal loss, the model can focus on training the pixels that have not been well trained yet, which is more effective and purposeful.