Understanding Focal Loss for Pixel-level Classification in Convolutional Neural Networks

Original article was published on Deep Learning on Medium

A distance-aware cross entropy loss

Since the issues with standard cross entropy loss discussed above, tasks with highly imbalanced labels such as facial point detection [Sun 2013], human pose detection [Newell 2016] etc. adopted mean square error (MSE) loss for training. The label images are shown in Figure 3.

Figure 3: facial point detection (above), and human pose detection (below)

MSE works by minimizing the difference between real and predicted key-point positions in a pixel level regression framework. The MSE loss function is shown in Figure 4.

Figure 4: MSE loss function

The issue of this approach is that the predictions only contain the positions of pixels, while their semantic information is lost. If someone wants to use the predicted key points for some post process, she has to work out a pattern to deal with the intrinsic relationship among the points in advance, which might be complicated.

Therefore, in such tasks with highly imbalanced labels, can we train a model to learn the positions of foreground labels effectively while preserving their semantic information?

As discussed above, the answer is Yes! We can modify the standard cross entropy loss in two steps to get such a new loss function. Firstly, modify it to a focal loss [Lin 2017]. Secondly, modify the focal loss to be aware of distance to foreground labels. I call it distance-aware cross entropy loss [Law 2018].

Figure 5: focal loss [Lin 2017]

In the first step, standard cross entropy loss is modified to focal loss as shown in Figure 5. In the equation, pt is a measurement of prediction accuracy. The higher the pt is, the more accurate the prediction is. Since pt is between 0 and 1, the factor 1-pt can be used to decrease the standard cross entropy loss if the accuracy is already good enough, thus making the model focus on areas that have not been well trained yet. I recommend reading my last post for more details.

In the second step, as discussed above, background pixels within a small radius of the foreground pixel are allowed to be predicted to 1, with the allowance extent varying in a monotonic fashion according to their distances to the foreground pixel. Therefore, for the background pixels with label 0, a distance factor is added to decrease their loss values according to their distances to the nearest foreground pixel with label 1.

Figure 6: distance aware cross entropy loss [Law 2018]

In order to get that loss function, we split the focal loss into two equations according to different label values (0 and 1). Then a distance factor ycij is added as shown in Figure 6. Here, the distance factor ycij is a Gaussian function centered on a foreground label position in the 2D image space as shown in Figure 7. The function value at the center is 1 and decreases monotonically as getting far from center. Thus, the focal loss doesn ‘t change for foreground pixels (upper equation in Figure 6). For background pixels, under the influence of the factor 1-ycij, the focal loss decreases largely as getting close to foreground pixels, while keeps almost unchanged as getting far from them (lower equation in Figure 6).

Figure 7: Gaussian function in 2D

As a consequence, with the help from the distance factor, focal loss can be modified to focus on and converge the prediction results to the foreground pixels for tasks with highly imbalanced foreground-background labels. Focal loss refines the traditional pixel-wise discrete cross entropy loss to a continuous smooth loss as shown in Fig. 7, which is faster to converge.