Original article was published by Piero Paialunga on Artificial Intelligence on Medium
“But my mom says I’m beautiful” : a film written and directed by overfitting
A quantitative description of what overfitting is and how to avoid it, implemented in Python.
When you are a kid, sometimes your overprotective mother makes you feel beautiful, smart and kind. Of course, if you are one of those kids, you are confident that everyone thinks exactly the same thing that your mother thinks about you, but when you grow up and you go to school, sometimes your teacher tells you that you are acting wrong, and you are not so kind or smart or beautiful! In that moment you need to realize that maybe your mother loves you too much and gave you a false impression of yourself: your model is overfitting 🙂
If we think in Machine Learning terms, we find ourselves in an ‘overfitting’ scenario when our computational power is higher than how much it is required for our specific task. In other words, our algorithm is able to have good performances in our specific dataset, but it is not able to generalize the task for a dataset that it has never seen.
If you think about it, our brain is capable of generalizing stuff in a really efficient way. If you see a cat walking through the street or a toy representing a cat, or Doraemon you recognize that these three different entities refer to another one: a cat.
But if you try to build an artificial intelligence, this is not to take for granted.
In this sense, a huge mistake has been done by Google in 2015. This lack of ability, in fact, made Google tag two black people as Gorilla. Another huge bias appeared in 2018, when Amazon recruiter algorithm didn’t want to recruit any woman. Of course, AI is neither racist nor sexist. The problem there is that the Google algorithm has been trained with maybe a lot of gorillas but with really few black people, for sure. The same thing has to been happened to Amazon.
As we know that AI is getting more and more deeply involved into our lives, we need to be careful to the overfitting phenomenon as it could be really crucial.
Let’s go deeper.
1. The Problem
Let’s pretend we want to predict some values in the set of real numbers (regression). Of course if we have the analytic expression of the values we want to predict, we are not strictly “predicting” anything, because we have everything we need to get the real number in an analytical manner. For example if we have this function 𝑒𝑔(𝑥)=
Then if we want to now which real number corresponds to x=4, it is sufficient to replace x=4 in eg(x):
print ('The real number we want to "predict", for x=4, is %.4f' %(eg(4)))
The real number we want to "predict", for x=4, is 2996.2575
But life is not that easy 🙁 . If you find yourself in a situation where you need to use machine learning, it is because you are confident that this function exists, but you can’t find it in an explicit form. For our specific (and simple) example, let’s suppose that this function is known, and let’s pretend it is actually really simple:
We want to reproduce what happens when we deal with real data, so let’s pretend our data are a little more messy. For example, let’s pretend that our data are disturbed by a gaussian noise:
return [stats.norm.rvs(loc=f(x), scale=0.1, size=1).tolist() for x in r]
We would like to have an algorithm that is able to reproduce the following line:
Let’s suppose we want to attack this problem using polynomial regression. This kind of regression can be seen as a linear regression in a modified feature space.
Linear regression is a form of regression (that means predicting a real number (target) given a set of other numbers (data)) that uses a linear combination between the input (𝑥) and a set of parameters (𝑤_0 and 𝑤_1):
Now, let’s think about another (polynomial) feature space. For example, let’s suppose we are considering a 3 degree feature space:
for r in x:
So we converted a monodimensional input in a 3-dimensional one.
Then, polynomial regression can be expressed by the following set of parameters:
Now, let’s suppose we choose this set of parameters randomly and let’s choose this set of parameters to get our prediction:
for i in range(4):
Not that satisfying, right?! Don’t worry, we can do better than that. Linear regression (or polynomial, whatever), has a closed optimum formula for its coefficients, obtainable, for example, with the maximum likelihood criterion (https://tvml.github.io/ml1920/note/linregr-notes.pdf):
- Phi stands for the modified data input
- t stands for the target vector: the list of values we want to predict
And don’t forget to be lazy! You don’t need to implement the formula in an explicit manner: someone else did this for you! Let’s use the optimum formula for our specific case:
Mh, it looks slightly better. If we want to know if our predictor is doing a good or bad job we can use the following error estimator: Mean Squared Error (MSE):
- 𝑛 is the extension of our dataset
- 𝑡_ 𝑖 is the target of our dataset (the i element of the target)
- 𝑦_𝑖 is the prediction of our dataset (the prediction on the i element of the dataset)
And it is, geometrically speaking, a measure of the mean distance between the target and the prediction. In our specific case:
for i in range(len(pred_four)):
print('The MSE using polynomial of degree=3 is the following: %.3f'%(error_four))
Can we do better than that? Let’s try to increase the degree of our polynomial feature space and see what happens!
for d in DEG:
for i in range(len(pred)):
print ('For degree=19, MSE=%.4f'%(MSE[len(MSE)-1]))
It looks way better! Actually, the error seems to decrease as we increase the degree of the polynomial features:
But as the saying goes, all that glitter is not gold. We used the same dataset to train our model, and to the test the optimum model. In fact, we used the closed-form expression mentioned before on the whole dataset, and we checked the performance of the model on the whole dataset too: we are playing dirty.
Let’s use the same parameters of the best algorithm, and let’s see how it performs on a part of the dataset it has never seen before:
for i in range(len(pred)):
print(‘MSE on test set, for degree=19 is the following: MSE=%.2f’%(overfitting_error/len(pred)))
As we can see, the error we make using this model is actually huge. This phenomenon, as we could expect, is known as overfitting. If we train our model with a specific set of data, it may goes extremely well on that specific set, but it could perform really bad in a set of point it has never seen before.
2. The strategy
As we have seen, it is extremely simple to fall in the overfitting trap, and we need to be careful. Let’s show some strategy here:
2.1 Keeping it simple : Comparing the MSE
A really simple strategy is to compare the MSE on the training set and on the test set and take the less MSE on the test set, rather than on the training set:
for d in DEG:
for i in range(len(pred)):
for i in range(len(pred_test)):
As we can see, the MSE is extremely high at a certain point in the test set, even if it keep decreasing in the training set. When the MSE start increasing, we can stop increasing the order of our polynomial. For example, in our case we could think using 𝐷=1 (don’t get fooled, MSE start increasing soon on the test set!):
Not that satisfying, right? Let’s look for some other options!
A more robust approach consists in changing the loss function, adding a penalty that limits the values of the parameters. With this approach, we may use a higher polynomial but some w_i might be really close (or equal) to zero. There exists a huge set of regularization, but right now we are mentioning two important type:
- Ridge Regularization
- Lasso Regularization
2.2.1 Ridge Regularization
It sums the loss function (given by the MSE expression) with 𝛼‖𝑤‖². We want a function that minimizes the distance between our prediction and the target, but, at the same time, we want to have the minimum values of ‖𝑤‖². We are doing that to decrease the power of the algorithms we are considering, thus avoiding that they are too much influenced by the data they have been trained with:
for a in ALPHA:
for d in DEG:
for i in range(len(pred_test)):
NEW_ALPHA=list(itertools.chain.from_iterable(itertools.repeat(x, 20) for x in lst))
for i in range(len(ALPHA)):
It does work better! MSE is significantly lower (0.0526 vs 0.2479) and the curve manages to actually get the ascendent curve we would like to reproduce, even if it doesn’t manage to get the descending part.
It sums the loss function (given by the MSE expression) with 𝛼‖𝑤‖. We want a function that minimizes the distance between our prediction and the target, but, at the same time, we want to have the minimum values of ‖𝑤‖. We are doing that to decrease the power of the algorithms we are considering, thus avoiding that they are too much influenced by the data they have been trained with. See the difference between Ridge: this kind of regularization is more sever and it gives sparsity to our model. Almost the same lines of codes (it works perfectly replacing “Ridge” with “Lasso”) has been used, obtaining the following result:
In this case, we see that our model is too weak and it is not getting the true form we wanted to predict. In fact, even if it would be a third degree polynomial, every other parameter is forced to be zero, and the final result is a straight line. This is the dark side of the approach we used so far and it is called underfitting.
2.3 Cross Validation and other strategies:
Of course regularization is just one way to prevent overfitting. A lot of different strategies have been developed and studied, and a lot of new strategies will come. In these few lines, some of the masterpieces of overfitting prevention will be introduced but not analyzed in details.
In this approach we rotate the train set and the test set and we compare the mean performance of every model.
The dataset is thus divided into three parts:
- An external test set, used for the final performance evaluation
- A training set (iteratively changed), used to train the algorithm
- A validation set (iteratively changed), used to test the intermediate performance of the algorithm.
Let’s say our dataset is the following:
Then we split it in the following way:
Cross validation is based on iteratively changing (k-times) training set and test set on the “data” input. The mean MSE will be computed (using all the test set) and the best model (minimum mean MSE) will be used and tested on the external test set.
Overfitting has been intuitively and mathematically described through a really simple example: polynomial regression on a mono-dimensional input.
As overfitting could be extremely crucial into our lives (if AI tools are overfitted), it is really important to check if our model is robust and if it performs as we could expect even when it is applied with data it has never seen before. In other words, we need to be sure that our model is able to generalize its task to a new set of input data.
Some strategies to prevent overfitting, such as regularization and cross validation has been shown and explained.
Of course another good strategy isto use more and more data to train our model. In fact, if our data are enough representative, we will not have any overfitting, because our specific data will be a good sample for every other data that the algorithm hasn’t seen yet.