Forward and backward propagations for 2D Convolutional layers

Source: Deep Learning on Medium

Convolutional layer

The layer transforms the output of the previous layer A_prev of height n_H_prev, width n_W_prev and C channels into the variable Z of height n_H, width n_W and of F channels.

Convolutional layer: input and output shapes

The parameters of this layer are:

  • F kernels (or filters) defined by their weights w_{i,j,c}^f and biases b^f
  • Kernel sizes (k1, k2) explained above
  • An activation function
  • Strides (s1, s2) which defines the step on which the kernel is applied on the input image
  • Paddings p1, p2 which define the number of zero that we add on the borders of A_prev

Forward propagation

The convolutional layer forwards the padded input; therefore, we consider A_prev_pad for the convolution.

The equations of forward propagation are then:

Forward propagation equations

Backward propagation

Backward propagation has three goals:

  • Propagate the error from a layer to the previous one
  • Compute the derivative of the error with respect to the weights
  • Compute the derivative of the error with respect to the biases

Notation

For ease of notation, we define:

The maths!

In practice, we perform the backward pass of a layer by always knowing da_{i,j,f} or dz_{i,j,f} and we assume we know da_{i,j,f} for this case.

The expression of dz_{i,j,f} is then given by Eq. [2]:

Where g’ is the derivative of g.

Using chain rule, we can compute dw_{i,j,c}^f:

Recalling that dz_{m,n,k} is only linked to the kth filter (given by Eq. [1]), the weights of the fth kernel are only linked to the fth channel of dz;

We can then obtain dz_{m,n,f}/dw_{i,j,c}^f using Eq. [1]:

The expression of Eq. [5] is then:

One can notice that this is the cross-correlation of A_prev_pad with the kernel dZ

The same procedure is followed for the bias:

And therefore:

The last thing to perform is the backpropagation of the error: finding the relation between dA_prev and dZ.

Remembering that we have a relation between dZ and the padded version of dA_prev (given in Eq. [1]), we will consider computing da_prev_pad.

Using chain rule (again!), we have:

We recognize dz_{m,n,f} as being the first term of the sum, this is good. Let’s focus on the second term:

Which is not equal to zero if and only if m’+m=i, n’+n=j and c’=c.

Therefore:

And so,

We notice that Eq. [9] describes a convolution where the layer’s filters are considered to be the image, and where dZ is the kernel.

We finally obtain da_prev_{i,j,c} by selecting da_prev_pad_{i+p1,j+p2,c}, p1 and p2 being the padding values around the first and second dimensions for this layer.