Loss Functions and the Mathematical intuition behind them

Source: Deep Learning on Medium

Understanding Loss Functions and the Mathematical Intuition behind them

A Loss Function is an essential step in any Deep Learning Problem. First of all, what is a loss function?

A Loss function is just an evaluation method that gives information about how well your model is working. If the predictions using the model are totally different from the true values then the loss function outputs a larger number. As you make changes to improve your model, loss function will tell you whether the model is actually showing any improvements on the given dataset or not.

A loss function gives an idea about how good our classifier is and it does quantify how happy you are with the existing classifier

In this discussion, we will be mainly focussing on Classification Loss Functions. Before going into exploring different Loss Functions, we need to know about the Probability distribution and about entropy

Probability Distribution

A probability distribution is a collection of probabilities of all the outcomes in an experiment. Each Probability in the distribution must be in between 0 and 1(0 and 1 inclusive) and sum of all the probabilities in the distribution must be equal to 1.

If x,y,z are probabilities of 3 different outcome classes in an experiment. It must satisfy the following conditions

x + y + z = 1 and 0≤ x ≤ 1, 0≤ y ≤ 1, 0≤ z ≤ 1

[0.5,0.89,0.6] → is not a probabilistic distribution

[0.05,0.89,0.06] → is a probabilistic distribution

Entropy

Entropy is the measure of randomness of the information. Entropy in statistical terms defines the measure of disorder or uncertainty of the probabilistic distribution of the information. Higher the entropy, harder to draw any concrete results from that information.

Let’s explore this with an example. Consider the following figure.

Suppose there are three students who completed their exams and waiting for the results.

Student A knows that there is a 90% chance he will fail in the exam and the chance of getting passed is 10%

Student B knows that there is a 70% chance he will fail in the exam

But Student C is facing a 50/50 probability of either passing or failing an exam.

Among the 3 students, Student C is facing higher uncertainty. So which means the entropy of Student C is higher compared to the other two. Let’s prove this by using the mathematical approach

figure-1

Consider Student A

Probability of pass → 10% and Probability of fail → 90%

You can think it as an experiment of 10 outcomes in which 9 are F and 1 is P

Student A → FFFFFFFFFP

P(F) = 9/10, P(P) = 1/10

Outcomes of A = [ F, F, F, F, F, F, F, F, F, P]

The similar approach followed for Student B and Student C (figure -1)

The product of probabilities of all outcomes for all students

figure-2

Now imagine if we have more than 2 outcomes, The product of probabilities will produce a very small number. I think most people feel comfortable dealing with sums than with products. One way to change from products to sum is by using a log.

log(ab) = log(a) + log(b)

The log of the product of probabilities of all the outcomes of all students

figure-3

The Mean of the resulting log of the product all outcomes probabilities of all students

figure-4

Entropy is the negative of the Mean

figure-5

Entropy for Student C is more which means Student C is facing the highest uncertainty among all the students.

Calculation of Entropy with the help of scipy module in python

from scipy.stats import entropyentropy_st_a = entropy([9/10, 1/10])
#entropy_st_a = 0.3250829733914482
entropy_st_b = entropy([7/10, 3/10])
#entropy_st_b = 0.6108643020548935
entropy_st_c = entropy([5/10, 5/10])
#entropy_st_c = 0.6931471805599453

What is Cross-Entropy?

Cross-Entropy is used as a loss function in Deep Learning. In entropy, we deal with only one probabilistic distribution. In Cross-Entropy, we deal with two different probabilistic distributions.

The first distribution consists of true values. The second distribution consists of estimated values.

Cross-entropy is a measure of the difference between two probability distributions.

Let’s walk through Cross-Entropy with an example

Consider a single record which consists of 4 different outcomes

The true probabilities of the 4 outcomes of that record are [0.2,0.3,0.4,0.1]

You estimated probabilities of the 4 outcomes of that record with the help of some classifier or model.

Your estimated probabilities are [0.4, 0.4, 0.1, 0.1]

True Values t1, t2, t3, t4 = 0.2, 0.3, 0.4, 0.1

Estimated Values p1, p2, p3, p4 = 0.4, 0.4, 0.1, 0.1

Entropy (for true value t) = -t*log(t) → Entropy needs only one distribution

Cross-Entropy (for true probability t and estimated probability p) = -t*log(p) → Cross-Entropy needs two distributions to compare themselves.

Higher the Cross-Entropy, higher the dissimilarity between the two probability distributions

Cross-Entropy loss is the most commonly used loss function in classification problems in either Machine Learning or Deep Learning.

In general, in Machine Learning they use a different term for cross-entropy and it’s called log loss

In Deep Learning, there are 3 different types of cross-entropy loss

  1. Binary Cross-Entropy Loss → The name of the cross-entropy when the number of classes or number of outcomes in the target class is 2
  2. Categorical Cross-Entropy Loss → The name of the cross-entropy when the number of classes or number of the outcomes in the target class is more than 2 and the true values of the outcomes are one hot.
  3. Sparse Categorical Cross-Entropy Loss → The name of the cross-entropy when the number of classes or number of the outcomes in the target class is more than 2 and the true values of the outcomes are not one hot.

So we will go through all the 3 different loss functions

BINARY CROSS-ENTROPY LOSS

Consider you are dealing with a classification problem involving only two classes and three records

The true classes of those three records are [[1. , 0.], [1. , 0.], [0. , 1.]]

The prediction probabilities of those records are [[.9, .1], [.7 , .3], [.4 , .6]]

Importing NumPy for efficient numerical calculations

import numpy as nptrue_values = np.array([[1.,0.],[1.,0.],[0.,1.]])predictions = np.array([[.9,.1], [.7,.3], [.4,.6]]true_values[0] = [1.,0.]
predictions[0] = [.9,.1]

bce_loss_first = -(1 * np.log(0.9))
#bce_loss_first = 0.10536051565782628
true_values[0] = [1.,0.]
predictions[0] = [.7,.3]
bce_loss_second = -(1 * np.log(0.7))
#bce_loss_second = 0.35667494393873245
true_values[0] = [0.,1.]
predictions[0] = [.4,.6]
bce_loss_third = -(1 * np.log(0.6))
#bce_loss_third = 0.5108256237659907

We have calculated all the individual losses for the respective records

The final loss occurred or the resulting cost is calculated by taking the mean of all the individual losses.

loss = (bce_loss_first + bce_loss_second + bce_loss_third)/3
#loss = 0.3242870277875165

Now we use the same records and the same predictions and compute the cost by using inbuilt binary cross-entropy loss function in Keras

import tensorflow as tf
from tensorflow.keras.losses import BinaryCrossentropy
#tensorflow is imported to convert records into tensors#importing the Keras Binary Cross-Entropy function
bce_loss = BinaryCrossentropy()
m = tf.cast([[1.,0.],[1.,0.],[0.,1.]], tf.float32)n = tf.cast([[.9,.1], [.7,.3], [.4,.6]], tf.float32)loss = bce_loss(m,n).numpy()
#loss = 0.32428685

With and without using high-level functional API we achieved the same result.

Now based on the intuition we obtained we are going to build our own Binary Cross-Entropy loss function. This is the gist of what we have discussed so far in a function.

def binary_cross_entropy(true_values,predictions):
y_true = tf.cast(true_values,dtype=tf.float32)
y_pred = tf.cast(predictions,dtype=tf.float32)
X = tf.multiply(m,tf.math.log(n))
return (-tf.reduce_sum(X)/len(n)).numpy()
true_values = [[1.,0.],[1.,0.],[0.,1.]]
predictions = [[.9,.1], [.7,.3], [.4,.6]]
loss = binary_cross_entropy(true_values, predictions)
#loss = 0.32428703

CATEGORICAL CROSS-ENTROPY LOSS

Binary Cross-Entropy is a special case of Categorical Cross-Entropy

Consider you are dealing with a classification problem involving only 3 classes/outcomes and 3 records.

The true outcomes are one hot encoded

The true classes or outcomes of those records

The predicted probabilities of those records

import numpy as nptrue_values = [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]]predictions = [[.9, .05, .05], [.05, .89, .06], [.05, .01, .94]]true_values[0] = [1., 0., 0.]
predictions[0] = [.9, .05, .05]
celoss_first = -(1 * np.log(0.9))
#cce_loss_first = 0.10536051565782628
true_values[0] = [0., 1., 0.]
predictions[0] = [.05, .89, .06]
cce_loss_second = -(1 * np.log(0.89))
#cce_loss_second = 0.11653381625595151
true_values[0] = [0., 0., 1.]
predictions[0] = [.05, .01, .94]
cce_loss_third = -(1 * np.log(0.94))
#cce_loss_third = 0.06187540371808753

The final loss occurred or the resulting cost is calculated by taking the mean of all the individual losses.

loss = (cce_loss_first + cce_loss_second + cce_loss_third)/3
#loss = 0.09458991187728844

Using the same records and the same predictions, computing the cost by using inbuilt categorical cross-entropy loss function in Keras

from tensorflow.keras.losses import CategoricalCrossentropytrue_values = [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]]
predictions = [[.9, .05, .05], [.05, .89, .06], [.05, .01, .94]]
cce = CategoricalCrossentropy()
loss = cce(true_values, predictions).numpy()
#loss = 0.09458993

Now based on the intuition we obtained we are going to build our own Categorical Cross-Entropy loss function. This is the gist of what we have discussed so far in a function.

def categorical_cross_entropy(true_values,predictions):
y_true = tf.cast(true_values,dtype=tf.float32)
y_pred = tf.cast(predictions,dtype=tf.float32)
X = tf.multiply(y_true,tf.math.log(y_pred))
return (-tf.reduce_sum(X)/len(y_true)).numpy()
true_values = [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]]
predictions = [[.9, .05, .05], [.05, .89, .06], [.05, .01, .94]]
loss = categorical_cross_entropy(true_values,predictions)
#loss = 0.09458993

SPARSE CATEGORICAL CROSS-ENTROPY

The true outcomes are not one hot encoded

Class or outcome numbers start from 0 onwards. If there are 4 outcomes or 4 classes present then the class labels are 0, 1, 2, 3

Consider an example consists of 6 records and each record has 3 outcomes

True outcomes

Outcome estimated probabilities of all the records

We can follow two approaches for calculating the loss

One approach is by converting the true labels or true outcome classes into one hot label and follow the same procedure as we followed in Categorical Cross-Entropy loss

We will explore the 2nd approach with the above example

true_values = [1, 0, 2, 1, 2, 0]predictions = [
[.05, .9, .05],
[.89, .05, .06],
[.05, .01, .94],
[.1, .8, .1],
[.7, .2, .1],
[.08, .05, .87]
]

Now we got the predictions and true_values, the number of predictions and the number of true_values are same

Now consider each item in true_value and each item in predictions based on their respective index

Now extract the estimated probability of the true class from the predictions

Probabilities of occurring all the true classes with respective to true classes = 1 and the probabilities of occurring all the estimated classes with respective to true classes = PREDICTIONS[TRUE]

Calculate the cross-entropy for each record considering each value in P(TRUE/TRUE) and P(TRUE/PREDICTIONS) of each record as two distributions

Cross-Entropy loss for this dataset = mean of all the individual cross-entropy for records that is equal to 0.8892045040413961

Calculation of individual losses

individual_ce_losses = [-np.log(predictions[i][true_values[i]]) 
for i in range(len(true_values))]

Calculation of final loss by taking the mean of the individual losses

loss = np.mean(individual_ce_losses)
#loss = 0.8892045040413961

Using the same records and same outcome estimations calculate the sparse categorical cross-entropy using SparseCategoricalCrossentropy function in Keras

cce = tf.keras.losses.SparseCategoricalCrossentropy()
loss = cce(
tf.cast(true_values,dtype=tf.float32),
tf.cast(predictions, dtype=tf.float32)
).numpy()
#loss = 0.8892045

Based on the intuition we obtained we are going to build our own Sparse Categorical Cross-Entropy loss function.

def sparse_categorical_cross_entropy(true_values, predictions):
y_t = tf.cast(true_values,dtype=tf.int32)
y_p = tf.cast(predictions,dtype=tf.float32)
losses = [-np.log(y_p[i][y_t[i]]) for i in range(len(y_t))]
return np.mean(losses)
loss = sparse_categorical_cross_entropy(true_values, predictions)
#loss = 0.8892045

Conclusion

This was a brief discussion about how some of the classification loss functions work. We discussed the math behind them, built from scratch and how to use them by using an API.

You will find the complete code and data files associated with our discussion at github