Original article was published on Deep Learning on Medium
Over-fitted and Under-fitted models.
In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained(test) data.
So,What is Over-fitting?
- Over-fitting in a statistical model is that which describes random error or noise instead of the underlying relationship.
- Over-fitting occurs when a model is very complex, such as having too many parameters relative to the number of observations.
- A model that has been over-fitted, has poor predictive performance, as it overreacts to minor fluctuations in the training data.
- Over-fitting is basically a modeling error which occurs when a function is too closely fit to a limited set of data points.
And,What is Under-fitting?
- Under-fitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.
- Under-fitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.
- Intuitively, under-fitting occurs when the model or the algorithm does not fit the data well enough.
- Under-fitting occurs if the model or algorithm shows low variance but high bias.
How to combat Over-fitting and Under-fitting?
To combat over-fitting:
- Resampling the data to estimate the model accuracy (k-fold cross-validation) and by having a validation dataset to evaluate the model.
- Early Stopping is a form of regularization used to avoid overfitting when training a model with an iterative method, such as gradient descent.However,a small drawback with the early stopping is that it simultaneously try’s not to over-fit the model as well as optimize the cost function,which leads to not so optimized cost function as it stopped early.(To avoid this L2 regularization is used).
- Pruning is extensively used while building decision tree models. It simply removes the nodes which add little predictive power for the problem in hand.However,it is not needed in RandomForest algorithm as in the algorithm random trees uses random features and so the individual trees are strong but not so correlated with each other.
- Regularization, It introduces a cost term for bringing in more features with the objective function. Hence it tries to push the coefficients for many variables to zero and hence reduce cost term.
To combat under-fitting:
- Under-fitting can be avoided by using more data and also reducing the features by feature selection.
- Increase the size or number of parameters in the ML model.
- Increase the complexity or type of the model.
- Increasing the training time until cost function of the Model is minimized.
Good Fit in a Statistical Model:
- Ideally, the case when the model makes the predictions with 0 error, is said to have a good fit on the data.
- This situation is achievable at a spot between overfitting and underfitting. In order to understand it we will have to look at the performance of our model with the passage of time, while it is learning from training dataset.
With the passage of time, our model will keep on learning and thus the error for the model on the training and testing data will keep on decreasing. If it will learn for too long, the model will become more prone to overfitting due to presence of noise and less useful details. Hence the performance of our model will decrease.
- In order to get a good fit, we will stop at a point just before where the error starts increasing. At this point the model is said to have good skills on training dataset as well our unseen testing dataset.As shown in the picture below.