Original article was published on Deep Learning on Medium
Summarising Research-Understanding Deep Learning requires Rethinking Generalisation.
I Wrote this article while applying for Google Research India AI Summer School 2020. Wherein I was asked to summarise the BEST PAPER OF ICLR(International Conference on Learning Representations) 2017. So here is the summary which I submitted.
“Understanding deep learning requires rethinking generalization”. Find the paper by clicking on the underlined text.
ABSTRACT AS MENTIONED IN THE RESEARCH
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.
When we look into the concept of an artificial neural network for eg: A Convolutional Neural Network, the number of learnable (or) trainable parameters is much much greater than the number of samples they are trained on. This gives the artificial neural network the capacity for brute force memorization which is large enough to shadow the training data. Despite this Deep neural network models exhibit remarkably small generalization error while at the same time it is easy to come up with model architectures that generalize poorly. This questions our understanding of deep neural networks as we don’t know how to distinguish between the two cases. This paper talks about rethinking generalization.
Why rethinking Generalisation?
Rethinking generalization can help us to increase the interpretability of neural networks to come up with more reliable and better model architectures and to give better theoretical explanations for the model’s performance on tests and training sets. Rethinking generalization is a great yet complex thing to do according to me because it requires us to explore neural networks from a completely different perspective.
Certain questions come up when we do this. They are:
1)What distinguishes neural networks that generalize well from those that don’t.
2)Can it be explained by traditional theoretical approaches?
3)How can we understand the affective model capacity of feed-forward neural networks and its effect on generalization ability?
Answers to the above questions as explained in the research paper
In the paper, the researchers try to problematize the traditional view of generalization by showing that it is not capable of distinguishing between neural networks that have radically different generalization performance. This was done by running some randomization tests explaining the role of explicit and implicit regularizers in generalization and analyzing the power of neural nets on a finite sample. They used a variant of the well-known randomization test from non-parametric statistics as a core idea of their methodology.
Fitting random labels and pixel
As the first set of experiments, a candidate architecture model was taken and trained with true data and the copy of this true data in which the true labels were replaced by random labels. In the second set of experiments, the image data were replaced by random pixels; they observed that with random labels once the fitting starts it converges quickly and fits the training set perfectly without changing the learning rate schedule. They also experimented by performing label corruption this was done by varying the degree of randomization from no noise signals to full noise signals a significant slowdown was expected but not observed through the experiment and the test error or generalization error converges to 90% when the label corruption approaches one.
Role of Explicit Regularizers(Dropout, weight decay, Data augmentation)
The researchers tried to decode the rule of explicit regular risers in deep learning by testing on Alexnet, inception, and MLP architectures with CIFAR 10 and Image Net data sets by turning explicit regularization on and off. The experiment results for CIFAR 10 and image nets show that there was very little difference. To Understand the role of implicit regularizers’ they ran various tests and found that early stopping could potentially improve generalization in Image Net and is not necessarily helpful CIFAR 10 but batch normalization improves generalization. They found that bigger gains could be achieved by simply changing the model architecture than using an implicit or explicit regularization technique. They concluded that regularizers only help to marginally improve generalization performance but are not the fundamental reason for achieving it.
The conclusion which I would like to draw is that Statistical learning theory struggles to explain the generalization ability of neural networks. The reasons for easy optimization must be different from the cause of generalization. Findings imply that models are rich enough to memorize training data and the traditional view is incapable of explaining the generalization ability of neural networks whereas SGD is found to have self-regulating properties that tell us that a precise formal measure is yet to be discovered. This paper is probably considered important because it shows that Deep neural networks learn random datasets by memorizing them which means zero generalization and then a question arises; how it learns non-random datasets. It is well known by experts that a high-capacity parametric model with a well-conditioned optimization objective like ReLU, Batch normalization, high-dimensional spaces will just take in the input data like it is. I think of Deep Neural networks optimization as an extremely time taking but powerful optimizer thus it will discover semantically meaningful feature hierarchies if the right model biases are present and compatible with the input data, but if it isn’t convenient to optimize that solution, the network is perfectly happy to optimize in a way that just memorizes the data.