Convolutional Neural Networks — Part 4: The Pooling and Fully Connected Layer

Original article was published by Brighton Nkomo on Artificial Intelligence on Medium


Convolutional Neural Networks — Part 4: The Pooling and Fully Connected Layer

This is the fourth part of my blog post series on convolutional neural networks. Here are the pre-requisite parts for this post:

The final part of the series explains why it might be a great idea to use convolutions in a neural network:

1. Pooling Layer

Other than convolutional layers, ConvNets often also use pooling layers to reduce the size of the representation, to speed the computation, as well as make some of the features that it detects a bit more robust.

1.1 Max Pooling

FIGURE 1: Max pooling input and output image.

Suppose you have a 4 by 4 input, and you want to apply a type of pooling called max pooling. The output of this particular implementation of max pooling will be a 2 by 2 output. The way that you would do that is quite simple. Take your 4 by 4 input and break it into different regions (these regions are colored as the four regions as shown in figure 1). And then, in the output, which is 2 by 2, each of the outputs will just be the max from the corresponding shaded region. Notice that at the lower left green region, the biggest number is 6, and the red lower right region, the biggest number is 3. So to compute each of the numbers of the output on the right, we would take the max over the 2 by 2 regions. So, this would be as if you applied a filter size of 2 (it’s as if f = 2), because you’re taking 2 by 2 regions and you’re taking a stride of 2 (because you’re taking 2 steps to move a filter to a different colored region).

GIF 1: Max pooling illustration.

As we can see from the GIF illustration, the filter size f and the stride s are the hyperparameters of max pooling because we start from the upper left 2 by 2 filter (indigo colored filter) and we get a value of 9. Then, we step over by 2 steps to look at the upper right filter (light blue filter), to get 2 (since 2 is the max value in that light blue region), and then for the next row, you step the filter down 2steps to get 6, and then we take 2 steps to the right to get 3.

If you think of the 4 by 4 region as some set of features, the activations in some layer of the neural network, then the large number means that it has maybe detected a particular feature (if the input image is a cat, then a feature could be a cat’s eye, cat’s whisker, cat’s nose etc.). So what the max operation does is say as long as a feature is detected anywhere in one of the colored quadrants, it then remains preserved in the output of max pooling. So, what the max operator does is really to say, if a feature detected anywhere in in filter, then keep a high number. However if a feature is not detected, so maybe that particular feature doesn’t exist in the upper right-hand quadrant (light blue region) then the max of those numbers will be small. We saw that the GIF 1 the max of the upper right quadrant is 2 which is much smaller than the upper left quadrant which is 9, so if you’re maybe detecting a particular feature such as a cat’s eye, then it’s likely to be in the upper left quadrant than the upper right quadrant.

I’ll paraphrase what Andrew Ng said: I think the main reason people use max pooling is because it’s been found in a lot of experiments to work well, and the intuition (that a much bigger max number of a particular quadrant corresponds to a feature detected), despite it being often cited, I don’t know of anyone fully knows if that is the real underlying reason.

So far, we’ve seen max pooling on a 2D inputs. If you have a 3D input, then the output will have the same number of channels as the input. For example, if you have a 5 by 5 by 2 input image then the output image will be a 3 by 3 by 2.