Understanding Regularization Techniques in ML and DL

Original article can be found here (source): Deep Learning on Medium

Understanding Regularization Techniques in ML and DL

A very simple yet deep insight into obtaining the perfect model by employing techniques that save both computational power and man-hours invested.

Photo by Joe Gardner on Unsplash

The overview:

Now, as industries start to accept “Artificial Intelligence” as an important part of predicting their company’s success, the techniques of Machine Learning and Deep Learning are making their way into the job profile list of companies. But it is often seen that the actual decision makers in a company (the people calling the shots: CxOs) have a very misguided notion of what these techniques can do and how their company can be benefitted. ML is often seen as a technology that has the potential to solve any and all industrial problems as per the people who don’t fully comprehend ML’s truth. The following picture makes the current state of ML quite clear.

This is not satire but a quite accurate understanding of ML. It is the un-hardcoded ability of computers to predict events while a set of precursor events is provided. I will try to keep this blog as non-mathy as possible but here I would like to include the fact that ML, in its essence, is the act of predicting the functional relation F between x and y given multiple such equations.

But many a time, it is seen that even after training a model and achieving an acceptable training accuracy, when the model is employed to work on the test cases, it fails miserably.

This happens due to the phenomenon of overfitting or making the function over-approximate the training data. This leads the model, instead of understanding the generic idea of how to solve the problem, to rote the training data. The following picture makes it clear.

The real function is a sinusoid (green) and we are trying to predict it from the data given. Till the third figure, we see the model learning very well. It is an almost perfect approximation of the function even though all data points are not satisfied. But as training continues, we see the function moulding itself to fit all data points and taking a form quite different from what is desired. This is overfitting. Where the training loss is taken to zero but the test loss rises.

Understanding Bias-Variance Tradeoff and the need for Regularization:

Bias is mathematically, the difference between the expected value and the actual value. We won’t be going into the underlying statistics of bias but I will responsibly, leave you with a scary looking equation:

To make things clear, the bias of a simple, linear model is high and that of a complex, multidimensional model, is low. This is because a complex model is better at fitting all the training data.

Variance is the change in prediction accuracy of an ML model between training data and test data. Error due to variance is the amount by which the prediction, over one training set, differs from the expected value over all the training sets. In other words, how far are the values of the different predictions from each other as perthe model. Another equation follows to scare you folks off.

A simple model has a low variance whereas a complex one has a high variance.

The following graph can be used to establish the concepts of Bias and Variance clearly. The left end of the graph is the zone with high bias as both the training and the testing error are high. This is the zone of underfitting, or the zone where the model has not learnt enough.

The right end of the model is an area of high variance where the training error is low and but the testing error is high. This is the zone of overfitting, where we see that even though the model has achieved a high training accuracy, and it seems like the model is near perfect, it performs poorly on test data. This is a sheer waste of computational power and the engineer’s time.

The middle zone, where both the bias and variance are low, even though not the lowest possible is the best possible zone for a model. The act of achieving this state of model training is known as Bias-Variance Tradeoff.

There are various methods using which we can achieve Bias Variance Tradeoff. These methods or techniques are known as Regularization Techniques.

Some common ones are:

  • L2 Regularization
  • Early Stopping
  • Dataset Augmentation
  • Ensemble methods
  • Dropout
  • Batch Normalization

L2 Regularisation:

Keeping things as simple as possible, I would define L2 Regularization as “a trick to not let the model drive the training error to zero”. If only things were that simple…

During training a model, we have continuous updation of the various variables (weights and biases; w & b) which try to predict our original function. This update takes place based on an “update rule” like Gradient Descent (which we won’t talk about). This update rule depends on the “loss function” which is a function of these variables. If things are getting complicated, bear with me. Our aim is to minimise this “loss function”. And that’s quite intuitive isn’t it. In any profitable industrial situation, you strive to minimise the loss. Simple, init?

So, we minimise the loss function during training. What is special about the L2 technique is that instead of minimizing the training loss, we minimise a different version of it.

Here, the last term is called the Gaussian noise by the math folks. For us, laymen, it is the sum of squares of all the weights (w). Here again, responsibly, I leave you with:

This minor adjustment (it is, believe me) prevents the training error to go to zero and keep it near the “sweet spot”.

Early Stopping:

This is, by far, the simplest regularization technique (well all of them are, but you wouldn’t believe me, would you). This process involves recording the values of the variables (w & b) at the minimum loss value. While going through the training process, we record the values of w & b, at which we obtain the least validation error. We stop training when we see the validation error rising again. This is a very useful procedure but the downside to it is, during training very deep neural networks or very complex models, this utilises a lot of processing power during writing and rewriting the minimum values.

Dataset Augmentation:

Training a model to a good prediction state is only possible when we have a lot of data to train it on. In other words, it is quite easy to drive the training error to zero if the data is too less. Let’s take the example of training a neural network on image classification. Say we have 1000 images to train the model on. Wouldn’t it be better if we had say, 3000 images to train it on? Without procuring extra data, we can easily “augment” the current images and create “new” ones. These are not in fact, new to us, but to the model, it is as new as they come.

So what is augmentation? It is changing your image in various ways such that it’s label identification property is not lost. Simply put, it is flipping, rotating, zooming, color inverting and all other things your creativity allows that let’s the image of a kitten, remain a kitten. Here’s some cuteness to cancel out your boredom from reading this blog.

So, when we now have more data to feed our model, which makes it more difficult for it to memorize the entire thing and therefore, the training error isn’t driven to zero. Kinda like your history test, init?

Ensemble Methods:

Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance, bias, or improve predictions.

The above paragraph is Google’s definition of Ensemble Methods, and I’ll try to break it down for you. In this technique, we employ multiple model architectures to predict an output, be it classification or regression. Let’s say model A, B and C are given the task of classifying a dog: Model A says it’s a cat, but B and C say it’s a dog. So if we are to believe what the majority says, we arrive at the correct output, but if we were to trust the output of the first model, we would have erred. Similarly with regression, or value prediction. We take the weighted average of the predictions given the 3 models to arrive at our final output. This decreases the chance of an error and improves accuracy.

The interesting part about this is we needn’t spend resources on 3 models as well. We could train on the same model 3 times with different batches of the data. That would serve the purpose as well. But you get the idea, don’t you?


Dropout is also classified in the category of an ensemble method. But I, for fun, think it to be the reverse of that. In ensemble methods you ‘ask’ the opinion of other models to arrive at a conclusion but here, it is basically silencing other contributors. Let me make it clear.

This is a very simple neural network whose purpose is to be a True/False classifier. Look at the output layer (green). It has 2 blobs, one that gives the probability that the output is True and the other False. The sum of the two values: you guessed it: 1! Aren’t you smart? XD.

The idea here is to make you understand that these ‘blobs’ are called nodes. Each of these nodes have a ton of complex calculations happening inside them. Remember the stuff I was talking about in L2 Regularization? It all happens here. So these nodes are the actual contributors to the output.

Dropout involves turning off certain nodes randomly. This changes the architecture of the model and the way information flows through the nodes. Doing this makes the model a more robust predictor. The model has to predict the same outputs with some of its contributors turned off. That’s like saying you need to get through your quiz without your topper friends being around. You gotta learn. Get it? XD.

The Conclusion:

So that sums up my blog on regularization techniques. I intentionally did not provide you with information on Batch Normalization as that would have required me to give you the entire process of training a neural network and that would have gone against the main idea behind this blog: keeping things simple.

If you are itching to know how to code these on Python using PyTorch, refer to the following repository on GitHub. The batchnorm_dropout.ipynb file will be of interest. I will be uploading TensorFlow files on another repo as well, to have the code on both these frameworks.


I’ve had an amazing time writing this out for you folks and I hope you could take away something from this. If you liked it, leave a clap. If you didn’t you probably would’ve left the page long back. And if you have any queries, please feel free to comment down there. I’ll be looking forward to clear your doubts.

I love making new friends, so here is my LinkedIn ID. Please connect if you wanna chat or even if you don’t. XD.