The Bread and Butter from Deep Learning by Andrew Ng — Course 4.1: Convolutional Neural Networks

Source: Deep Learning on Medium

Computer Vision

Computer vision is rapidly improving with the help of Deep Learning. It is applied to things like self-driving car and face recognition system. It can solve problems like image classification, object detection, and style transfer. Such revolution in computer vision was only possible with the use of convolution layers. With fully connected layers, it costs way too many parameters to take images as inputs. For example, the size of matrix W that connects an input of size 1000 x 1000 x 3 and the hidden layer 1 with 1000 neurons will be (1000, 3 million). Learning 3 billion parameters for just one layer is too computationally expensive. Convolution layers provide solution to this problem.

Edge Detection

Edge detection is a basic example of convolution operation that is a fundamental element in the convolution layers. During convolution, a kernel (filter) will move around the input, and at each stop, we compute the sum of the products of overlapping values. The output of convolution is a matrix/vector with the collection of numbers that we computed at each stop. A vertical edge detector is a kernel with a set of positive numbers at the left and a set of negative numbers at the right. After convolution with an image, high numbers in the resulting matrix will tell us which part of the image has sudden changes in the pixel values from left to right. With the similar logic, a horizontal edge detector has a set of positive numbers at the top and a set of negative numbers at the bottom. The simplest version of edge detectors use 1 and -1, but more sophisticated edge detectors like Sobel filter and Scharr filter use unique sets of numbers. In Deep Learning, we try to learn those numbers by treating them as parameters. Usually edge detection is learned by the earlier layers of the network because edges are lower level feature.


If the size of convolution kernel is larger than 1 x 1, the resulting matrix has to be smaller than the input. The shrink in size results in information loss. By padding the input, we can solve this problem. For the padded (p x p) convolution of an input (n x n) with a kernel (f x f), the size of output is (n+2p-f+1) x (n+2p-f+1). There are two typical ways to pad an input. “Valid” convolution does no padding at all, and “Same” convolution uses padding so that the size of the output will stay the same as the size of the input.

Strided Convolution

During strided convolution, the step size is larger than 1. This will shrink the output size even more because the kernel will make less stops as it moves around with bigger step size. With stride (s), the size of the output will be ⌊(n+2p-f)/s+1⌋ or floor((n+2p-f)/s+1).

Convolution Over Volumes

Most images have channels like RGB. So, they have 3-D shapes. Thankfully, convolution rules work seamlessly with extra dimensions. With a pad (p x p x p) and a stride (s x s x s), the convolution of (n x n x n) with (f x f x f) will output a ⌊(n+2p-f)/s+1⌋ x ⌊(n+2p-f)/s+1⌋ x ⌊(n+2p-f)/s+1⌋ tensor. In order to deal with multiples channels, we use kernels that has the same depth as the image. As a result, convolution with a single kernel will output a 2-D matrix, and intuitively, convolution with N number of kernels will output a 3-D tensor with N channels.

Single Convolutional Layer

Conv layer has two parts. First, an input is convolved with multiple filters and a bias is added to each filter output. Then, we pass it to an activation function like ReLU. This is as if we did Wx+b = Z and then ReLU(Z) in FC layers. We can express two operations in Conv layers as W@x+b = Z and ReLU(Z). For each Conv layer, we can set different filter size (f), padding (p), stride (s), and number of channels (nc) for the W of layer. And, they are the hyper-parameters of the layer. The size of W is f x f x nc and the size of b is nc x 1. Therefore, the total number of parameters to learn in the layer is f²nc + nc.

Convolutional Network Example

In most Conv networks, as we propagate forward, the filter sizes get bigger and the outputs get smaller. And towards the end, for classification purposes, we unfold all the features to use FC layers. The most common layers used in Conv networks are Conv layer, FC layer, and pooling layer.

Pooling Layers

Pool layer reduces the size of the inputs to speed up computation and make features more robust. Pooling a 4 x 4 input with a 2 x 2 filter with the stride of 4 will output a 2 x 2 matrix. There are two types of pooling operations: max pooling and average pooling. Max pooling, which is used more often, picks the maximum number within the kernel, and average pooling takes an average of numbers within the kernel. Such operation does not require any parameter. The size of filter and stride are the hyper-parameters of each Pool layer.

Why Convolutions?

The two main advantages of convolution is parameter sharing and the sparsity of connection. A kernel is shared among every section of the input. For example, an edge detector is useful in detecting edges at any part of the image, with just few numbers. This is parameter sharing. The sparsity of connection means that each element of the output depends only on the small section of the input. For example, an element in the output of Conv layer with 3 x 3 filter will depend only on 9 numbers from the input. As we get deeper into the network, the outputs will depend on less and less numbers. This allows us to train with less samples and prevent overfitting. Convolution is also translation invariance.