Source: Deep Learning on Medium
Dropout, weight decay and data augmentation have all become part and parcel of every CNN developers standard toolkit. The assumption has been that each contribute in a hopefully synergistic manner towards producing an optimal CNN. However, recent research by Garcia-Konig (https://arxiv.org/abs/1806.03852v4 ) explored this assumption and found that a much better development approach is to rely entirely on data augmentation to produce self-regularized CNN’s, and forego the use of dropout and weight decay.
They find that while weight decay and dropout do enhance regularization, the average effect from it is 3.06% improvement in accuracy, versus light augmentation *alone* improves accuracy an average of 8.46%.
Further in comparing augmentation *alone* vs augmentation plus weight decay and dropout (the standard tool set) — augmentation alone equals or betters the performance of the combination set, ranging from 8.57% and 7.90% on different testing.
In other words, dropout and weight decay are a crutch that ultimately produce less than optimal CNN’s, and the optimal strategy is replacing both with more use of data augmentation.
Below is a chart showing the performance comparisons:
Perhaps most interesting as to why augmentation alone is proving to be superior is this quote from the paper:
“recent work has found that models trained with heavier data augmentation learn representations that are more similar the inferior temporal (IT) cortex, highlighting the biological plausibility of data augmentation.”
In other words, augmentation helps to mold the CNN towards better mimicking the human visual process and thus better generalization.
What’s the problem with dropout (or explicit regularization)? Dropout has been used for some time to improve generalization — however, as Garcia and Konig point out, it is done at the expense of:
1 — Blindly reducing a networks capacity (the dropouts randomly dumb it down in order to regularize). Also of note — most people implement dropout right from the start of training. Negative co-adaptions clearly require time and training to form, so believing that initialized tensors start at the outset as being overfit is a hard argument to win. Or as this paper on curriculum drop out puts it, a ‘sub-optimal’ choice ( https://arxiv.org/abs/1703.06229).
2 — Introducing model sensitive hyper-parameters. Dropout forces the model to mold around the architecture itself rather than the data per se, and thus creates model-sensitive hyperparameters that are more fragile for generalization
3 — Ironically, the dumbing down thus forces deeper and/or wider models to compensate for step 1, the blind reduction capacity.
However, when leveraging only data augmentation, Garcia-Konig show that you:
1 — Increase generalization as the total number of data points is increased, allowing the network to self-achieve it’s own regularization
2 — Avoid model sensitive hyper-parameters.
3 — Does not reduce the working capacity of the CNN,
4 — Most importantly, show that CNN’s trained with dropout alone out-perform those trained with the usual dropout/weight decay/augmentation combination.
Summary — Garcia-Konig elegantly summarize their findings as
“we have empirically shown that explicit regularization is not only unnecessary, but also that it’s generalization gain can be achieved by data augmentation alone.”
Reading the paper in it’s entirety is time well spent as it’s a fundamental re-think about current default practices: https://arxiv.org/abs/1806.03852v4
With this evidence that augmentation is the key for better generalization and better CNN’s, the next article(s) will review:
1 — CutMix as possibly the new and optimal default augmentation method (outperforming MixUp and Cutout),
2 — a review of BO-Aug, a framework for optimal selection of data augmentation strategies for datasets, for as low a computational cost as possible (very strong performance in testing).