Original article was published on Deep Learning on Medium

# Over-parametrized equals to overfitted? Here is the answer you need.

Along the development of machine learning and statistical learning, obtaining a model with better generalization strength is always the bread and butter. Generalization solely define the applicability of an intelligent agent in the reality. It should be able to generalize abstract concepts and features as human do if the agent is indeed really artificially intelligent.

In the classical world, experts propose bias-variance trade-off to explain a method to get a well generalized model in certain extent. After the advancement of computation technologies, we are able to train and deploy models with much more parameters and computation complexity with descent speed. Researchers found that the generalization theory seems different in those large models. The fact is that large model actually perform better in generalization.

This is quit weird by the classical wisdom. A large model is usually over-parametrized especially for those deep neural networks. The old day teaching tells us that a over-parametrized model should be overfitted to the seen data. But empirically, large models not just able to be not overfitted but even better than those optimal one in the under-parametrization regime. This is one of the mystery in deep learning.

This article will attempts to give an overview and in-deep explanation to bridge the classical perspectives and new perspectives.

## Classical Wisdom — Bias-Variance Tradeoff

In the content of supervised learning, bias is the expected difference between the prediction and the true value, while the variance is the expected spread of the prediction.

Usually, on one hand, a high bias model will suffer from the underfitting problem as it indicates that the model is not yet able to learn the feature pattern of the data space. On the other hand, a high variance model will suffer from overfitting as it show that it is very sensitive to the noise relative to the seen data.

In the classical perspective, when we increase the model size, bias and variance are inversely proportional to each other. When we try to reduce one of them, the other one will rise. So there will exist one optimal model size that balance both of them to obtain a model that relative good in generalization.

So that we need to make a tradeoff between bias and variance to find the one with well generalization. This phenomenon is well illustrated by the plot below which the model is a k-NN regression of some synthetic data. It is pretty clear that the optimal one is the one with k=7.

## Mystery Beyond The Classical Wisdom

In the claim of bias-variance tradeoff that we have mentioned above, after the optimal model complexity, continuously increasing it will worse in generalization as the variance start to increase dramatically and the decrease in bias no longer helps too much.

However, the facts that we got from the deep neural networks these years seems don’t obey this rule. After the advancement of computation technologies, model complexity for the state of the art models for various domains are increasing. Years by years, it is easy to find that one of the straight way to get a better model is to increase the model complexity with an appropriate architecture. Although there are different mechanisms and theories behind the increment effect about model complexity for different domains, this is unintuitive to those experts from the classical statistic learning if they are new to the development of the modern AI and data science.

When we go beyond the under-parametrization regime, although the generalization is still poor at when the model complexity is just exceed the interpolating threshold, the generalization ability will keep improving if we keep increasing the complexity and converge to the optimal. Moreover, this phenomenon seems not just happen on the deep neural networks and actually it also occur on the classical learning model.

In the figures that I quoted above, it’s a RFF(random Fourier features) model, which you can treat it as two layer fully-connected neural network with freeze parameters for one layer, for the MNIST recognition. it’s not hard to find that “the bigger, the better” also works for the classical model.