Over-parametrized equals to overfitted? Here is the answer you need.

Original article was published on Deep Learning on Medium

Over-parametrized equals to overfitted? Here is the answer you need.

Along the development of machine learning and statistical learning, obtaining a model with better generalization strength is always the bread and butter. Generalization solely define the applicability of an intelligent agent in the reality. It should be able to generalize abstract concepts and features as human do if the agent is indeed really artificially intelligent.

In the classical world, experts propose bias-variance trade-off to explain a method to get a well generalized model in certain extent. After the advancement of computation technologies, we are able to train and deploy models with much more parameters and computation complexity with descent speed. Researchers found that the generalization theory seems different in those large models. The fact is that large model actually perform better in generalization.

This is quit weird by the classical wisdom. A large model is usually over-parametrized especially for those deep neural networks. The old day teaching tells us that a over-parametrized model should be overfitted to the seen data. But empirically, large models not just able to be not overfitted but even better than those optimal one in the under-parametrization regime. This is one of the mystery in deep learning.

This article will attempts to give an overview and in-deep explanation to bridge the classical perspectives and new perspectives.

Classical Wisdom — Bias-Variance Tradeoff

In the content of supervised learning, bias is the expected difference between the prediction and the true value, while the variance is the expected spread of the prediction.

Usually, on one hand, a high bias model will suffer from the underfitting problem as it indicates that the model is not yet able to learn the feature pattern of the data space. On the other hand, a high variance model will suffer from overfitting as it show that it is very sensitive to the noise relative to the seen data.

In the classical perspective, when we increase the model size, bias and variance are inversely proportional to each other. When we try to reduce one of them, the other one will rise. So there will exist one optimal model size that balance both of them to obtain a model that relative good in generalization.

So that we need to make a tradeoff between bias and variance to find the one with well generalization. This phenomenon is well illustrated by the plot below which the model is a k-NN regression of some synthetic data. It is pretty clear that the optimal one is the one with k=7.

A k-NN regression model of some synthetic data. A smaller k means the model complexity is higher. Yellow plot is the testing error while green one is the bias and blue is the variance respectively. Figure by ‘The Elements of Statistical Learning — Data Mining, Inference and Prediction’

Mystery Beyond The Classical Wisdom

In the claim of bias-variance tradeoff that we have mentioned above, after the optimal model complexity, continuously increasing it will worse in generalization as the variance start to increase dramatically and the decrease in bias no longer helps too much.

However, the facts that we got from the deep neural networks these years seems don’t obey this rule. After the advancement of computation technologies, model complexity for the state of the art models for various domains are increasing. Years by years, it is easy to find that one of the straight way to get a better model is to increase the model complexity with an appropriate architecture. Although there are different mechanisms and theories behind the increment effect about model complexity for different domains, this is unintuitive to those experts from the classical statistic learning if they are new to the development of the modern AI and data science.

A classical model for MNIST recognition. figures by “Reconciling modern machine-learning practice and the classical bias-variance trade-off”

When we go beyond the under-parametrization regime, although the generalization is still poor at when the model complexity is just exceed the interpolating threshold, the generalization ability will keep improving if we keep increasing the complexity and converge to the optimal. Moreover, this phenomenon seems not just happen on the deep neural networks and actually it also occur on the classical learning model.

In the figures that I quoted above, it’s a RFF(random Fourier features) model, which you can treat it as two layer fully-connected neural network with freeze parameters for one layer, for the MNIST recognition. it’s not hard to find that “the bigger, the better” also works for the classical model.

The New World — Double Descent

“Double descent” is a name for the phenomenon that we see when we keep increasing model complexity against generalization ability over a wide range of application domains. Across the models that we reach, double descent seems not just a ‘local’ phenomenon, it actually seems universally exists.

In the figure that I quote above, a typical deep neural network for computer vision, along the journey that model complexity keep increasing, the testing error first start decreasing due to the adaptation of model parameters to the data features. After the sweet pot that propose by the classical wisdom, testing error start rising and generalization keep worsen. However, after the complexity exceed the interpolation threshold, the mystery happens. As long as we keep increasing the model complexity, test error keep decreasing and after certain complexity, the testing error start to be smaller than the sweet pot that we get within the under-parametrization regime.

The above example perfectly demonstrate that why increasing the model complexity almost always work when we want to obtain a better model. The answer for the above phenomenon is not yet confirm right now. It needs more time for investigation and theories to construct.

But there is a pretty interesting thing that we can see. In the above figure, the train error enters the optimal region after the interpolation threshold while the test error enter the region that keep improving generalization. It maybe because that when the complexity is large enough to fit the seen data, the excessive capacity and expressivity that brings by over-parametrization enable the model to attend and generalize the implicit pattern and features of the seen data. As learning those abstract features is the crucial criteria that influence whether the model is able to inference inductive bias that it get from the train data to unseen data, it finally keep improving as it some how get the global ability to learn the task that we assign to it through the data space.

One more interesting thing is that if we treat model complexity as human brain maturity. We can treat the regime of under-parametrization as exploration period when human learn about a new concept. Under this period, we will stick to generalize some obvious common pattern about the objects and it makes us keep improving to learn. But when we grow, the better maturity of our brain trigger us to think out of the box and start to wonder the conclusions that we got before. Finally after we fully discover those stuff that we saw before, we are able to see the full picture and finally generalize the global features that construct the concept that we need to learn. As the features that we conclude in this regime is global, we will perform better and better. This period is equivalent to the regime of over-parametrization.

Although the about perspective is not yet proved but we can see that it is a very interesting linkage between data-driven model and human.

Conclusion

So far we have discover a new appearance of the generlization over the model complexity and show that the old wisdom is a narrower local phenomenon compare to this new phenomenon. But there are still lots of effort to do for further investigations.

There are still lots new discovery about double descent and the practical skills and knowledge that it brings to experts not yet discussed here. Those important discovery perhaps blow your mind out and bring you lots of benefit when you want to construct and train a better model. I shall leave them for the later writings as not to make this article too lengthy

Reference:

https://web.stanford.edu/~hastie/Papers/ESLII.pdf