How Pytorch Backward() function works

Source: Deep Learning on Medium


It’s been few months since I started working with Pytorch framework and it’s incredibly amazing, its dynamic graphs, perfect level of abstraction and flexibility, over the above, its shallow learning curve.

Although I found it easy to get familiar with concept of dynamic graphs and autograd — if you’re not familiar with it I recommend this great article “Getting started with Pytorch part 1: understanding how automatic differentiation works” — however I found it confusing why Pytorch Backward() function takes a tensor as an argument ? what does it represent ? and where it’s supposed to take place in how autograd works? I found some answers online but still didn’t remove the ambiguity of this parameter, at the end I felt so dump when I found out it was a very simple concept so I decided to write this step by step guide to explain it in a simplified way specially for beginners likes me.


Step 1: The Gradient of scalar Loss function

when we use gradient descent as learning algorithm of our model we need to compute the gradient of the loss w.r.t the model parameters. 
the loss term is usually a scalar value obtained by defining loss function (criterion) between the model prediction and and the true label — in a supervised learning problem setting — and usually we call loss.item() to get single python number out of the loss tensor.

when we start propagating the gradients backward, we start by computing the derivative of this scalar loss (L) w.r.t to the direct previous hidden layer (h) which’s a vector (group of weights) what would be the gradient in this case ? 
simply it’s a Jacobian Matrix (nx1) that contains the derivative of the loss with respect to all the hidden layer variables (h1 h2 h3 … hn)

Step 2: the Gradient of vector loss function

let say now we want to compute the gradient of a some loss vector (l) w.r.t to a hidden layer vector then we need to compute the full Jacobian

by looking into our gradient descent step

it’s obvious that we’re actually interested in the gradient vector rather than the full Jacobian, the gradient thus can be obtained by summing the gradients of all variables in the loss vector w.r.t each variable in the hidden layer and would have the following formula

Note that when 𝑚=1 (scalar loss) the Jacobian is same as the gradient because it’s a generalization of the gradient.

Step 3: the Jacobian-vector product

we can easily show that we can obtain the gradient by multiplying the full Jacobian Matrix by a vector of ones as follows

awesome! this ones vector is exactly the argument that we pass to the Backward() function to compute the gradient, and this expression is called the Jacobian-vector product!

Step 4: Jacobian-vector product in backpropagation

To see how Pytorch computes the gradients using Jacobian-vector product let’s take the following concrete example:
assume we have the following transformation functions F1 and F2 and x, y, z three vectors each of which is of 2 dimensions

If we wanted to compute the gradient dz/dx using the chain rule, we will calculated it as follows: dz/dx = dz/dy * dy/dx, now since z is a leaf node we can directly compute the gradient dz/dy using the Jacobian-vector product where the full Jacobian is Jz and the vector is ones vector

Now if we wanted to compute dz/dx we will compute the full Jacobian Jy w.r.t x but instead of multiplying it by ones vector we multiply it by the gradient dz/dy vector as follows

brilliant! this way we can we can just keep multiplying the gradients vectors of descent nodes in our computational graph with full — Jacobians of their ancestors backward to the leaf nodes.
and we can interpret the ones vector that we provide to the backward() function as an external gradients from which the chain rule starts the multiplication process!

Why Pytorch uses Jacobian-vector product ?

as we propagate gradients backward keeping the full Jacobian Matrix is not memory friendly process specially if we are training a giant model where the one full Jacobian Matrix can be in size bigger than100K parameters, instead we only need to keep the most recent gradient which way more memory efficient.

What if we submitted numbers other than ones in the external gradient vector (Backwad() argument) ?

The values in the external gradient vector can serve like weights or importances to each loss component, let’s say we fed the vector [0.2 0.8] in the previous example, what will get is this

Pytorch example

#in case of scalar output
x = torch.randn(3, requires_grad=True)
y = x.sum()
y.backward() #is equivalent to y.backward(torch.tensor(1.))
print(x.grad)
#out: tensor([1., 1., 1.])

#in case of output vector 
x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
y = y * 2

v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)
#out: tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])

References

(1) Pytorch autograd tutorial
(2) “Getting started with Pytorch part 1: understanding how automatic differentiation works