So, the above is a distribution for a Random Variable Grade and it could take one of these values(A, B, C) and this distribution table tells: for every value that the random variable can take, the probability of the random variable taking on that value. And of course, all these values are going to sum to 1.

So, when we take the output of a Sigmoid Neuron, let’s say Sigmoid Neuron tells the output is 0.7 if we think of this as a distribution, the probability of output being ‘1’(or belonging to class 1) is 0.7 and the probability of output being ‘0’(belonging to class 0) is 0.3. So, this is the distribution that Sigmoid Neuron gives. And we also have the true output as a distribution, suppose that image contains text then the true output is 1 and true distribution would give the probability of image containing text as 1 and probability of image not containing text as 0.

We can represent these distributions as:

True Distribution: [0 1]

Predicted Distribution: [0.3 0.7]

As these two distributions are just two vectors, squared error loss can be used in this situation but by doing so, we are actually ignoring a few things like the fact this is actually a probability distribution, it has certain properties like all these values are greater than equal to 0 and less than 1 and all the values for a distribution would sum up to 1. So, we will look at a loss function that can deal with the quantities that are probabilities.

Certain Events:
Let’s say a tournament is going on in which there are 4 teams A, B, C and D and at the end team A actually won the tournament. So, that’s a certain event, that has happened. So, at this point, there is no question of the probability of team D winning or C winning or B winning because we know that A has actually won the game. So, for certain events (means sure events or events that happen with 100% probability) as well, we can still write it as a distribution.

Let’ say we have this Random Variable X which indicates which team has actually won the tournament and it can take on 4 values A, B, C and D. And now we can ask the probability of a particular team winning, and now since the event has already happened, we can write the following as the true distribution where all the probability mass is focussed on the certain or sure event:

So, for any kind of distribution, we need to give values for all possible outcomes and it is possible that some of these outcomes have 0 probability mass.

The true distribution for the above-discussed scenario is:

Now if you someone(who watched a few games of this tournament) about what he thinks about the outcome of the tournament and he tells you this the distribution as in the below image(yellow highlighted):

And wants to know how close his prediction was to the true output, so know this we can use the Squared Error Loss:

For sure events as well, we can write the output as a distribution and use the squared error loss function. And as mentioned above in this article, we will also discuss another loss function which takes into account the thing that these distribution values are probabilities.

Why do we care about distributions?
We will be given an image and our first task is to find out if it contains text or not.

At training time, we could think like there is some random variable associated here that tells Class and the class could be 0(means it does not contains text) or 1(means it contains text).

So, for the above case(when the text in the input is Mumbai), we can think that all the probability mass is on the event that the image contains ‘Text’ and the probability mass on the event ‘No Text’ is 0 because this is a certain event, we have already seen the image, it contains text so there is no probability here for the ‘No Text’ event, its certain that the image contains text, we could write it as below distribution(here as y equals [0 1] , this is known as one-hot encoded form as only one of the entries is 1 and rest all are 0):

At the training time, we are using a Sigmoid function(gives value between 0 and 1) and we want the output of the Sigmoid to be 1 when the image contains text and the output of the Sigmoid should be 0 if the image does not contain text. And the model would give us an output between 0 and 1 depending on how confident it is(about whether the input contains text or not).

Here x in the above equation represents the input image which could be of size say 30 X 30, so we have 900 values in x, and corresponding we have 900 weights and then bias terms; using all of this we compute the predicted output. Suppose it gives a value of 0.7 . Now again, we can think of this as a probability distribution, it tells the probability of the image containing text as 0.7 and the remaining i.e 1–0.7 = 0.3 as the probability that the image does not contains text. Ideally, if the model was perfect, it should have given an output of 1 and the probability of the image not containing text as 0 but the model is not perfect, its parameters are still being trained and the output we get at this point is 0.7.

And now we are interested in knowing how bad is the model in the current parameter setting so that we can update the weights accordingly to make it even better and slowly reach the distribution where the loss is 0.

We can use the squared error loss for this. We will discuss another loss function for this.

In the above scenario, we have discussed the Binary classification but the same concept holds for Multi-class classification as well.