Source: Deep Learning on Medium
demystifying gradients of softmax -1
what’s inside the derivative of softmax function
HELLO FRIEND !! if you are here, i’m sure you are as curious or confused as I was when dealing with the gradients of softmax. As a Data Scientist, we appreciate the power of softmax in our day-to-day networks. But often we limit ourselves with the nn.softmax and it’s cool that we go beyond that. Stay with me friend !! .
!! Disclaimer !! this is my first technical article and this article will be two-stage.
We know that softmax computes the below function on a forward pass
But things get interesting in backward pass . Let’s consider an input tensor a whose softmax transform is S with j elements. Since the computation involves sum across all the j elements of input tensor we cannot calculate the derivative like a sigmoid or relu as all other elements have impact on our element of interest(eg: 0.1526) . So it’s time for partial derivative and here comes the Jacobians !!!.
import numpy as npseed=7
#a --> [0.1526, 1.5598, 0.8768]s=torch.nn.functional.softmax(a,dim=-1)
#s --> [0.1399, 0.5715, 0.2886]
Now we have to compute jacobian matrix D for all i elements in Softmax output S w.r.t to all j elements in input a i.e for all combination of N elements across S and a .
Don’t worry Friend !! we will frame a couple of hypothesis to understand and compute the above derivative.
Time for hypothesis-2
I feel it’s good to take a break here . Catch you on the other side(stage-2). This article is open to constructive criticism .GoodBye Friend !!