One of the most common problems faced by machine learning and deep learning practitioners while building an ML model is “Overfitting ”.

A Machine Learning model is said to be “overfitting” when it performs well on the training dataset, but the performance is comparatively poor on the test/unseen dataset.

Consider 2 students who are preparing for an exam by studying or referring to the same book. Student 1 tries to memorize the questions and answers in the book without trying to understand the underlying concepts in different topics present in the book. Student 2 tries to grasp the concepts behind each topic rather than memorizing them unlike student 1.

The exam paper contains half the number of questions the same as the ones present in the book whereas the remaining questions are similar, but are made tricky to test the understanding of students.

In the above scenario, student 1 is not able to perform very well in the exam as he was only able to answer the straightforward questions that were asked from the book whereas, for the remaining questions, he wasn’t able to perform so well. In comparison to this, student 2 performed well on each and every question as he had a better understanding of the concepts that he learned from the book.

An ML model is said to be underfitting when it does not performs well on both train as well as test dataset.

E.g. A student who neither memorized any questions nor did he try to understand any of the concepts in the book. Thus, he was unable to perform well on any of the questions in the exam.

Below is a pictorial representation of Overfitting, Underfitting, and Best/Appropriate fitting.

Image Source: Google

Regularisation techniques can help us to prevent overfitting of ML models. Different types of ML Regularisation techniques are:-

(I) L1 and L2 regularisation:-
In both L1 and L2 regularisation, the ML model is penalized for overfitting on train data i.e. when the model tries to predict everything correctly on train data points.

In many Machine Learning technique like Logistic Regression, Support Vector Machine, etc. we add a regularisation term(penalty term) to the “loss function” so that the loss term does not become zero on train data.

Below is the “Logistic Loss” function which is the loss function in case of “Logistic Regression”:

Image Source:- stackoverflow.com
In the above image, the loss function is without the regularisation term.

The ML model aims at reducing the log loss to an as low value as possible.

If the loss function is without the regularisation term, then the ML model will increase the weight parameter “x ” to a very high value(ideally infinity) to make the overall loss close to zero. But this is something that will result in overfitting of the ML model as it will perform very well on the training set which we want to avoid.

L2 regularisation:-
To avoid overfitting we add a regularisation term as shown below:

Image Source:- stackoverflow.com
The 2nd term in the loss function is the “L2” regularisation term. Here, the “squared magnitude of weight parameter” is added along with lambda (which is the hyperparameter to be tuned while building the model) to the logistic loss function.

L2 regularisation is one of the most widely used and proven regularisation techniques used by ML practitioners that helps us to build robust ML models that are able to generalize well.

If the weight co_efficient “x ” is made high, to reduce the 1st term in the loss function, then the second term will increase, thereby avoiding the overall loss function value from becoming zero. This way, the regularisation term penalizes the model for trying to make very accurate predictions on the training dataset points.

Features of L2 regularisation:-

L2 regularisation, also known as “Ridge regression” performs better than L1 regularisation in most cases.
The less important features are shrunk but are not made zero.
L1 Regularisation:-
Below is the loss function with L1 regularisation term added in it:

Image Source:- stackoverflow.com
Here, the “absolute value of weight parameter” is added along with lambda (which is the hyperparameter) to the loss function.

Similar to L2 regularisation, if the weight co_efficient “x ” is made high, to reduce the value of 1st term in the loss function, then the second term i.e. the L1 regularisation term will increase, thereby avoiding the overall loss function from becoming zero.

L1 regularisation penalizes the model less compared to L2 regularisation as it uses absolute values rather than the squared values of weight parameters in the loss function.

Features of L1 regularisation:-

L1 regularisation, also known as Lasso Regression, makes the less important features to zero, unlike L2 regularisation.
Thus, L1 performs internal feature selection. Because of this, it is preferred in applications where we have some kind of hard cap on the number of features we can use.
Elastic-Net Regularisation:-
Elastic-Net Regularisation is a combination of both L1 and L2 regularisation. It can be represented as shown below:

Source: stats.stactexchange.com
Alpha in the above formula is the same as the term lambda used in the case of L1 and L2 regularisation formula.

The overall penalty applied to the ML model to penalize for overfitting is more in Elastic-Net regularisation compared to L1 and L2 regularisation.

Features of Elastic Net Regularisation:-

Elastic-net is a compromise between the L1 and L2 regularisation that attempts to shrink and do a sparse selection simultaneously.
The SKLearn’s implementation of different ML algorithms have the term called as “penalty” where we can specify one of the above 3 mentioned regularisation techniques that we want to use while training the model. Below are some images from the SKLearn’s SGD Classifier documentation which represents the default value of the “penalty” term and also the available options.

Image source:- https://scikit-learn.org/stable/modules
Image source:- https://scikit-learn.org/stable/modules
(II) Data Augmentation:-
Although not very widely discussed as compared to other techniques in case of regularisation, Data Augmentation can help us to reduce overfitting.

A dataset with less number of data points but high number of features is more prone to overfitting. Data Augmentation refers to adding more relevant data in the training set such that the total number of data points used for training the model increases, and is sufficient enough for the model to understand the underlying pattern in the data so that it can generalize well.

However, the process of collecting data is costly and time-consuming. Also, finding data relevant to the problem we are solving is not always easy to obtain.

(III) DropOuts:-
Dropout is one of the most widely used regularisation techniques in Deep Learning.

Dropout, as the name suggests, is based on the process of “randomly dropping nodes” in a neural network. We specify a probability value for dropout which indicates the probability of a node getting dropped in each iteration.

Image source: Wikipedia
Suppose, we specify the probability of a node getting dropped as 0.5(i.e., flip of a coin). In every iteration, some of the nodes from both input as well as hidden layers are dropped, resulting in a more simpler neural network that makes decisions based on the available nodes only. As less no of nodes are available in each iteration, the computation time at each iteration is reduced. This is similar to the ensemble models in ML(like Random Forests and GBDT) which combines multiple weak learners to predict the output.

Image Source: researchgate.net
Thus, using neural networks with different sets of nodes in each iteration helps to capture more randomness in the data and usually performs better than using a single and fully connected neural network.

Source :- https://machinelearningmastery.com
I above image, the value of “0.2” in Dropout represents the probability value of each node getting dropped. The probability value is a hyperparameter that must be tuned so that we can get the best probability value that helps us to obtain the optimum results.

(IV) Early Stopping:-
Early Stopping is another very widely used regularisation technique to avoid overfitting while building ML and Deep Learning models. As the name suggests, we “stop early” during the training phase, before the model starts overfitting on the training dataset.

Here, we use a validation set along with the training set, and we monitor both the training and validation errors before deciding on when the model will stop training further.

Source:- fouryears.eu
In the above image, the model will stop training at the “blue line” as after the blue line, the CV error starts increasing whereas the training error continues to decrease resulting in overfitting.

Source :- https://machinelearningmastery.com
In the above image, the “monitor” value denotes the metric that will be monitored during the training phase to decide on when the model will stop training. Here, we are monitoring the “validation loss” during the training phase. The value of “patience” indicates after how many iterations the model will stop training as it finds no further improvement in the “validation error”.

— — — — — — — — — — X — — — — — — — — — -X — — — — — — — — — —

The above mentioned techniques are some of the most widely used ones in ML and DL that have helped ML practitioners in building robust ML models that are able to generalize well.

My next blog will be on “Performance metrics in ML and DL” were we will dive deep into the details of some of the most commonly used performance metrics and discuss the pros and cons of each of them.

Please share your necessary feedback and questions.