Gradient of Weight while backpropogating a affine layer

Original article was published by /u/moonkraken on Deep Learning

Hello all,

I’ve been solving the cs2321n course assignments for some time. What intrigues me is this:

scores = X.dot(W)+b,, where X is a N x D matrix, W is a D x C matrix and b is the bias of dimension (C,).

Assume dscores => dscores / dL (where L is the loss), and dW => dW/dL , for clear representation.

Then dscores = X.dot(dW) But I read some code here.

and it’s given there that:

dW = np.dot(X.T, dscores) whose origin doesn’t make sense to me. They say they’ve used the chain rule to get to this result, but I do not seem to understand how.

Any help is highly appreciated!

submitted by /u/moonkraken
[link] [comments]