Regularization in Deep Learning Models

Source: Deep Learning on Medium

Regularization in Deep Learning Models

Learn techniques like L2 regularization and Dropout for neural networks while solving an AI business use case


Deep Learning models have so much flexibility and capacity that overfitting can be a serious problem, if the training dataset is not big enough. Sure it does well on the training set, but the learned network doesn’t generalize to new examples that it has never seen!

Let’s tackle this using an interesting problem.

Problem Statement

You have just been hired as an AI expert by the French Football Corporation. They would like you to recommend positions where France’s goal keeper should kick the ball so that the French team’s players can then hit it with their head.

The goal keeper kicks the ball in the air, the players of each team are fighting to hit the ball with their head

They give you the following 2D dataset from France’s past 10 games.

Each dot corresponds to a position on the football field where a football player has hit the ball with his/her head after the French goal keeper has shot the ball from the left side of the football field.

  • If the dot is blue, it means the French player managed to hit the ball with his/her head
  • If the dot is red, it means the other team’s player hit the ball with their head

Your goal: Use a deep learning model to find the positions on the field where the goalkeeper should kick the ball.


This dataset is a little noisy, but it looks like a diagonal line separating the upper left half (blue) from the lower right half (red) would work well.

We will first try a non-regularized model. Then we will learn how to regularize it and decide which model to choose to solve the French Football Corporation’s problem.

Non-Regularized Model

We will implement a three-layer neural network:


Here, the individual functions are described in the final code provided at the end of this article in the form of Utility scripts.

To train the above model —

parameters = model(train_X, train_Y)
print ("On the training set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

The output —

The train accuracy is 94.7% while the test accuracy is 91.5%. This is the baseline model (we will observe the impact of regularization on this model).

Let’s plot the decision boundary of this model.

plt.title("Model without regularization")
axes = plt.gca()
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

You can clearly see that the non-regularized model is obviously overfitting the training set. It is fitting the noisy points!

Now, we will try to solve this problem of overfitting using some Regularization technique. Let’s look at them.

What is Regularization ?

Let’s think this to be a logistic regression setup where you want to minimize a loss function to get the optimal values of w & b.

Now from Logistic Regression let’s move to a Neural Network where instead of L-2 norm we use a very similar norm with a little modification called as Frobenius Norm.

Now with this above modified cost function (objective fn) you need to perform the back-propagation and update the weights accordingly to get the global minima.

So, lets implement L-2 regularization to our model.

L-2 Regularized Model

Let’s now run the model with L2 regularization (λ=0.7). The model() function will call:

  • compute_cost_with_regularization instead of compute_cost
  • backward_propagation_with_regularization instead of backward_propagation

The test set accuracy increased to 93%. You are not overfitting the data anymore. Let’s plot the decision boundary.

L2 regularization makes your decision boundary smoother. If λ is too large, it is also possible to “over-smooth”, resulting in a model with high bias.

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.

Now let’s explore another regularization technique called Dropout.

Dropout Regularization

It randomly shuts down some neurons in each iteration. With dropout, what we’re going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network. In that way you come with a much smaller network. And then you do the back propagation in a much smaller network using one example (observations). Then for the next example you randomly eliminate some other nodes and train the model on that diminished network.

To implement this dropout let’s use a technique called “Inverted Dropout”. Let’s illustrate this with an example with layer l = 3.

Inverted dropout technique helps to scale up the expected value of the next activation layer, in this case it is Z[4] and also at test time when you are evaluating a neural network this makes it easier as there is less of a scaling problem.

At test time we don’t do dropouts as we don’t want to randomize our output.

So, other than using a smaller neural network where you have a higher regularization effect, the other intuition for drop-out is that you can’t rely on one feature. So, randomly knocking out nodes spread out the weights all over and shrinks the squared norm of the weights.

Here one of the hyperparameters was “keep-prob” which means the probability of keeping the nodes. It can vary by layers. The underlying principle is if any layer has many hidden nodes then the “keep-prob” should be low, which means you should knock more nodes out of that layer so that the model is not overfit and vice-versa. Which also means that the layer where we don’t think will overfit, the keep_prob could be 1 (which means you are keeping every unit and not doing a dropout in that layer).

Dropouts are very common in the field of Computer vision as input size is so big because it has all these pixels as an input.

One of the downsides of dropout is that the cost function J is no more well defined as in every iteration you are randomly knocking out some of the nodes. So, monitoring the Gradient Descent is difficult. So, it’s better to turn it “off” and check that the Gradient Descent is monotonically decreasing and then turn the dropout “on”to reduce the overfitting.

Let’s now run the model with dropout (keep_prob = 0.86). It means at every iteration you shut down each neurons of layer 1 and 2 with 14% probability. The function model() will now call:

  • forward_propagation_with_dropout instead of forward_propagation.
  • backward_propagation_with_dropout instead of backward_propagation.
parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3)print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

Dropout works great! The test accuracy has increased again (to 95%)! Your model is not overfitting the training set and does a great job on the test set.

plt.title("Model with dropout")
axes = plt.gca()
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

A common mistake when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training.

Note : Deep learning frameworks like tensorflow, keras or caffe come with a dropout layer implementation.


We can clearly see that dropout regularization performs the best. Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system and your French Football team.

Source Code

Other Regularization Techniques

  • Data Augmentation — Say, you are training a cat classifier. It might not be possible or expensive to get more data. So, what you can do is to flip the image horizontally and add the data to the training set. So, now your training set is doubled.

You can also take the random crops of those images like —

So, by synthesizing images like this you are making the training more generalized.

For Optical Character Recognition (OCR) you can also distort the digits.

  • Early Stopping — This is another technique of regularization. As you train your model using Gradient Descent, the Cost Function (J) should decrease monotonically. In “early stopping” along with this you also plot dev-set error. Typically this error decreases and increases after a while. In “Early Stopping”, you stop training the neural network half-way where it was doing really good and beyond that the incremental accuracy increase is not that great.

It works because when you haven’t run enough iterations for this neural network, vector “w” will be close to 0 as they were randomly initialized with small values. As you keep training (i.e. the number of iterations are increased) it becomes bigger & bigger in values.

“Early Stopping” has a mid-sized value of W, which is very similar to the L-2 norm so wit will overfit less. The downside is you need to optimize this hyperparameter as well. You already have so many hypertparameters to optimize, adding one more makes it more complicated. There is a concept called “Orthogonalization” in this context where you focus on one task at a time i.e. you use one method to either minimize the loss or minimize the overfitting. Early Stopping takes a stab on both these at one time which make things overly complicated. So, the better way to do this is use L-2 regularization and train your neural network as long as possible.


  1. Deep Learning Specialization by Andrew Ng & team.