Original article can be found here (source): Artificial Intelligence on Medium
Problem: Overfitting, Solution: Regularization
What makes a model overfitting and how we can solve this issue
We all have those friends who tell stories in excruciating detail. When you ask them about a movie they saw recently, you may end up learning not only about the movie but also the watch of the guy selling popcorn at movie theater. On the other hand, we have uncommunicative friends who would just tell you that the movie was “good” or “bad”.
Overfitting and underfitting are similar to those different types of friends in terms of the amount of detail. An overfit machine learning model tries to pick up details on training data whereas an underfit one is too general and tends to miss important trends in training data.
A more machine-learning-like example would be predicting the object in an image. Assume we are trying to build a model that predicts tomatoes in images:
- Model A: Red, circle, green star shape on top, a few water droplets
- Model B: Red, circle
The problem with model A is that not all tomatoes have water droplets on them. This model is too specific and likely to pick wet tomatoes. It is not generalized well to all tomates. It will look for water droplets so cannot predict dry tomatoes in an image. It is overfitting.
On the other hand, model B thinks everything that is red and has circle shape is a tomato which is not true. This model is too general, not able to detect critical features of tomatoes. It is underfitting.
These examples are not exactly how a machine learning model learns and predicts but give an overview of overfitting and underfitting. In this post, I will cover the following concepts in detail:
- Bias and variance
- Overfitting and underfitting
Bias and Variance
Bias and variance are essential to understand overfitting and underfitting.
Bias is a measure of how far the average prediction is away from the real values. Bias arises if we try to model a complex relation with a very simple model. The predictions of a model with high bias are very similar. Since it is not sensitive to the variations within data, the accuracy of the model is very low on both training data and test data (previously unseen data).
The blue dots are the observations in the training set and the red line is our biased model which does not care the fluctuations within the two features of observations and does the same prediction.
Variance is the opposite of bias in terms of being sensitive to the changes within data. A model with high variance is highly sensitive to even small changes within training data. It tries to pick up every small detail and thus very small changes on training data also changes the model. Models with high variance also tend to capture noise in data. An outlier would be in the scope of the model.
As you can see, the model tries to adjust according to all variations within data. Predictions of a model with high variance are widely spread out. It is clear that this model has a very high accuracy on training set. Howevet, it will perform poorly on new, previously unseen observations.
Machine learning models are built to work on previously unseen observations so models with high variances are not acceptible. We can also never rely on models with high bias. Therefore, it is crucial to find the line between bias and variance. There is always a trade-off between bias and variance. We can easily find a model with high bias and low variance or low bias and high variance. However, the success of your model depends on finding the optimal way in between bias and variance. For example, the model below seems like a good fit.
Overfitting and Underfitting
- A model with high bias tends to underfit.
- A model with high variance tends to overfit.
Overfitting arises when a model tries to fit the training data so well that it cannot generalize to new observations. Well generalized models perform better on new observations. If a model is more complex than necessary, it is highly likely we end up with overfitting. Underfit models do not generalize well to both training and test data sets.
In a supervised learning task, we can detect overfitting by comparing the model accuracy on training and test datasets. If accuracy on training dataset (observations that model see) dataset is much higher than the accuracy on test dataset (unseen observations), then the model is overfitting.
The loss is proportional to the difference between the actual target value and the predicted value. A supervised learning model performs several iterations to minimize this loss by updating feature weights. However, after some point, model behaves differently on test and training data. Loss just keeps decreasing on training data but starts to increase on test data after some point. It is crucial to detect this point to create an outstanding machine learning model.
Overfitting is a series issue for machine learning models but how do we prevent a model from overfitting? The answer is regularization.
The main reason of overfitting is making a model more complex than necessary. If we find a way to reduce the complexity, then overfitting issue is solved.
Regularization penalizes complex models.
Regularization adds penalty for higher terms in the model and thus controls the model complexity. If a regularization terms is added, the model tries to minimize both loss and complexity of model.
Regularization reduces the variance but does not cause a remarkable increase in the bias.
Two common methods of regularization are L1 and L2 regularization. The complexity of a model depends on:
- Total number of features (handled by L1 regularization), or
- The weights of features (handled by L2 regularization)
It is also called regularization for sparsity. As the name suggests, it is used to handle sparse vectors. If we have high-dimensional feature vector space, the model becomes very difficult to handle.
L1 regularization forces the weights of uninformative features to be zero. L1 regularization acts like a force that subtracts a small amount from the weight at each iteration and thus making the weight zero, eventually.
L1 regularization penalizes |weight|.
It is also called regularization for simplicity. If we take the model complexity as a function of weights, the complexity of a feature is proportinal to the absolute value of its weight.
L2 regularization forces weights toward zero but it does not make them exactly zero. L2 regularization acts like a force that removes a small percentage of weights at each iteration. Therefore, weights will never be equal to zero.
L2 regularization penalizes (weight)²
There is an additional parameter to tune the L2 regularization term which is called regularization rate (lambda). Regularization rate is a scalar and multiplied by L2 regularization term.
Note: Choosing an optimal value for lambda is important. The goal of L2 regularization is simplicity. However, if lambda is too high, the model becomes too simple and thus is likely to underfit. On the other hand, if lambda is too low, the effect of regulatization becomes negligible and the model is likely to overfit. If lambda is set to zero, then regularization will be completely removed (high risk of overfitting!).
Note: Ridge regression uses L2 regularization whereas Lasso regression uses L1 regularization. Elastic net regression combines L1 and L2 regularization.
Overfitting is a crucial issue for machine learning models and needs to be carefully handled. We build machine learning models using the data we already know but try or test them on new, previously unseen data. We want the model to learn the trends in the training data but, at the same time, do not want the model to focus too much on the training data. The success of a model depends on finding optimum point between overfitting and underfitting.