demystifying gradients of softmax

Source: Deep Learning on Medium

demystifying gradients of softmax -1

what’s inside the derivative of softmax function

HELLO FRIEND !! if you are here, i’m sure you are as curious or confused as I was when dealing with the gradients of softmax. As a Data Scientist, we appreciate the power of softmax in our day-to-day networks. But often we limit ourselves with the nn.softmax and it’s cool that we go beyond that. Stay with me friend !! .

!! Disclaimer !! this is my first technical article and this article will be two-stage.

We know that softmax computes the below function on a forward pass

But things get interesting in backward pass . Let’s consider an input tensor a whose softmax transform is S with j elements. Since the computation involves sum across all the j elements of input tensor we cannot calculate the derivative like a sigmoid or relu as all other elements have impact on our element of interest(eg: 0.1526) . So it’s time for partial derivative and here comes the Jacobians !!!.

import torch
import numpy as np
seed=7
torch.random.manual_seed(seed)
np.random.seed(seed)
a=torch.from_numpy(np.random.uniform(low=0.0,high=2.0,size=(3,)))
a.requires_grad_(True)
#a --> [0.1526, 1.5598, 0.8768]
s=torch.nn.functional.softmax(a,dim=-1)
#s --> [0.1399, 0.5715, 0.2886]

Now we have to compute jacobian matrix D for all i elements in Softmax output S w.r.t to all j elements in input a i.e for all combination of N elements across S and a .

img source
len(i) and len(j) are equal

Don’t worry Friend !! we will frame a couple of hypothesis to understand and compute the above derivative.

Long Live LaTeX

Time for hypothesis-2

I feel it’s good to take a break here . Catch you on the other side(stage-2). This article is open to constructive criticism .GoodBye Friend !!