12 Main Dropout Methods : Mathematical and Visual Explanation

Original article was published on Deep Learning on Medium

We can, therefore, see that the greater the weight, the greater the probability that the neuron will be omitted. This powerfully limits the high predictive capacity that some neurons may have.

Gaussian Dropout

The list of dropout methods applied to neural networks continues to grow. So before moving on to something else than DNNs I would like to talk about a category of Dropout method that is certainly the most fascinating.

To take just a few examples, Fast Dropout [4], Variational Dropout [5] or Concrete Dropout [6] are methods interpreting dropout from a Bayesian perspective. Concretely, instead of having a Bernoulli mask, we have a mask whose elements are random variables following a Gaussian distribution (Normal distribution). I won’t go into the demonstration of the law of Large Numbers here, that’s not the point. So let’s try to understand this intuitively.

Dropout with p=0.5

Papers [4], [5] and [6] show we can simulate a Bernoulli mask for our dropouts with a normal law. But what difference does it make. Everything and nothing at the same time. It does not change anything concerning the relevance of these methods against overfitting due to co-adaptation and/or the predictive capacity of our neurons. But it changes everything in terms of the execution time required for the training phase compared to the methods presented before.

Logically, by omitting at each iteration neurons with a dropout, those omitted on an iteration are not updated during backpropagation. They do not exist. So the training phase is slowed down. On the other hand, by using a Gaussian Dropout method, all the neurons are exposed at each iteration and for each training sample. This avoids the slowdown.

Mathematically, there is a multiplication with a Gaussian mask ( for example centered in 1 with Bernoulli’s law standard deviation p(1-p)). This simulates the dropout by randomly weighting their predictive capacity by keeping all neurons active at each iteration. Another practical advantage of this method centered in 1: during the testing phase there is no modification to be made compared to a model without dropout.

Pooling Dropout

The “difficult” comprehension part of this article is over. Remaining the more intuitive part giving to us better performances.

The issue with images or feature maps is that pixels are very dependent on their neighbors. To put it simply, on a cat picture, if you take a pixel that corresponds to its coat then all the neighboring pixels will correspond to the same coat. There is little or no difference.

So we understand the limits of the Standard Dropout method. We could even say it is inefficient and the only change it brings is additional computation time. If we randomly omit pixels on an image then almost no information is removed. The omitted pixels are nearly the same as their surroundings. It means poor performance to prevent overfitting.

Why not take advantage of the layers which are proper and often used in the CNNs. For example the Max Pooling Layer. For those who don’t know: the Max Pooling Layer is a filter passed on a picture or (feature map) selecting the maximum activation of the overlapping region.

Max-Pooling Dropout [7] is a dropout method applied to CNNs proposed by H. Wu and X. Gu. It applies Bernoulli’s mask directly to the Max Pooling Layer kernel before performing the pooling operation. Intuitively, this allows minimizing the pooling of high activators. It is a very good point to limit the heavy predictive capacity of some neurons. During the test phase, you can then weight as for the previous methods by the probability of presence.

The Max Pooling Layer has been taken as an example, but the same could be done with other Pooling Layers. For example, with the Average Pooling Layer, we could apply a dropout in the same way during the training phase. Then in the test phase, there would be no change since it is already a weighted average.

Spatial Dropout

For the CNNs, we can take advantage of the Pooling Layers. But we can also go smarter by following the Spatial Dropout [8] method proposed by J. Tompson et al. They propose to overcome the problem with classical dropout methods because the adjacent pixels are highly correlated.

Instead of randomly applying a dropout on the pixels we can think about applying a dropout per feature map. If we take the example of our cat, then this is like removing the red from the image and forcing it to generalize on the blue and green of the image. Then randomly other feature maps are dropped on the next iterations.

I did not know how to write properly in mathematics to make it intelligible. But if you understood the previous methods, you won’t have any trouble with it. In the training phase, a Bernoulli mask is applied per feature map with a probability of omission p. Then during the testing phase, there is no dropout but a weighting by the probability of presence 1-p.


Let’s go deeper into our approach to overcome the fact that adjacent pixels are highly correlated. Instead of applying Bernoulli masks per feature map, they can be applied in areas. This is the Cutout method [9] proposed by T. DeVries and G. W. Taylor.

By taking one last time the example of our cat image: this method makes it possible to generalize and thus limit overfitting by hiding areas of the image. We end up with images where the cat’s head is dropping. This forces the CNN to recognize the less obvious attributes describing a cat.

Again in this section no math. This method depends a lot on our imagination: square areas, rectangles, circles, on all feature maps, on one at a time or possibly on several… It’s up to you. 😃


Finally, to conclude this section on the CNNs, I must point out that obviously several methods can be combined. This is what makes us strong when we know the different methods: we can take advantage of their benefits at the same time. This is what S. Park and N. Kwak propose with their Max-Drop method [10].

This approach is in a way a mixture of Pooling Dropout and Gaussian Dropout. The dropout is performed on the Max Pooling Layer but with a Bayesian approach.

In their paper, they show that this method gives results as efficient as with a Spatial Dropout. In addition to the fact that at each iteration, all the neurons remain activated which limits the slowdown during the training phase.
These results were obtained with µ = 0.02 and σ² = 0.05 .