Softmax is most widely used activation function in Deep learning and this almighty activation function rides on the concept of Cross Entropy. When I started using this activation function, it was hard for me to get the intuition behind it. After Googling a bit and munching on the concepts I got from different sources, I was able to get a satisfactory intuition and I would like to share it in this article.
In order to get complete intuition, we need to understand concepts in following order: Surprisal, Entropy, Cross-Entropy, Softmax
“Degree to which you are surprised to see the result”
Now its easy to digest my word when I say that I will be more surprised to see an outcome with low probability in comparison to an item with high probability. Now, if yi is the probability distribution of ith outcome then we could represent surprisal (s) as:
Since I know surprisal for individual outcomes I would like to know surprisal for event. It would be intuitive to take a weighted average of surprisals. Now the question is what weight to choose? Hmmm…since i know probability of each outcome, taking probability as weight makes sense because this is how likely each outcome is supposed to occur. This weighted average of surprisal is nothing but Entropy (e) and if there are n outcomes then it could be written as:
Now, what if each outcome’s actual probability is pi but someone is estimating probability as qi. In this case, each event will occur with the probability of pi but surprisal will be given by qi in its formula (since that person will be surprised thinking that probability of the outcome is qi). Now, weighted average surprisal, in this case, is nothing but cross entropy(c) and it could be scribbled as:
Cross entropy is always larger than entropy and it will be same as entropy only when pi=qi. You could digest last sentence after seeing really nice plot given by desmos.com
In the plot I mentioned earlier, you will notice that as estimated probability distribution moves away from actual/desired probability distribution, cross entropy increases and vice-versa. Hence, we could say that minimizing cross entropy will move us closer to actual/desired distribution. Ta-da…that what we want. Hence, we are trying to reduce cross entropy loss so that our predicted probability distribution end up being close to actual one. Finally, we get the formula of cross-entropy loss as:
And in case of binary classification problem where we have only two classes, above formula becomes:
Feeling satisfied now?
Source: Deep Learning on Medium