Original article can be found here (source): Deep Learning on Medium
Beyond over fitting, when more data hurts your AI model.
Double descent & deep learning : why your intuition could be wrong.
When training deep learning model, you can expect that:
- bigger data set,
- more parameters,
- limiting number of epochs
will improve your model’s score.
As demonstrated in a paper released by Preetum et al. last December “DEEP DOUBLE DESCENT”. Following graphs are from this paper. Link to the original blog post : https://openai.com/blog/deep-double-descent/
What is « double descent » ?
It’s when the error of a model decreases in two distinct phases depending on a given parameter (for example, the model size):
In between the two phases appears a critical regime where error increases.
This phenomena is an empirical observation which can appear when the model complexity goes beyond the “interpolation threshold” (i.e. when the complexity is barely enough to fit the training data set).
Then, neither classical statistician paradigm (simpler models are better) nor modern ML paradigm (bigger model are better) are fully valid.
This has surprising consequences!
#1 Bigger model can be worse
When entering the critical regime, for a given number of epochs, Test error increases with model size, until the interpolation threshold.
Beyond the interpolation threshold, error will start decreasing again (note that train error becomes close to 0).
#2 More data can hurt
Even more surprising: for a given model size and a given number of epochs, using a bigger data set may result in a bigger error, because it shifts the position of the critical regime.
#3 Increasing epochs can invert over-fitting (yes !)
Just like for model size, there is an epoch-wise double descent effect. Below is the test error depending on the model size and number of epochs:
As you can see, if the model is big enough, there can be a local maxima of the test error. Once this maxima is reached, increasing epochs will start reducing error.
Now you might wonder why ?
Authors suggest that at the interpolation threshold only one model fit training data set and all its parameters are strongly tuned. Thus, this model will be very sensitive to noise and won’t have generalization ability.
While beyond the threshold, many models can fit the training data set and stochastic gradient descent (SGD) method will lead to models with better generalization ability.
It’s only an hypothesis, authors highlight that this is an important open question, to be continued then…
How to address double descent?
A data set without noise is less likely to cause double descent. So as usual, a clean dataset is a must !
However, it is not always the case, authors show that it can also happen with 0% of noise in data. Additionally, you usually don’t have the choice of the data set you’re working with. What to do then ?
- Set up a proper early stopping strategy, in most cases it can prevent double descent effect:
2. Apply regularization to your model.
If you know other options, leave a comment below!