Solve the Doughnut Problem

Source: Deep Learning on Medium

Solve the Doughnut Problem

Learn how to initialize the weights of a neural network

Introduction

The idea for writing this article is two folds.

  • First, to solve an interesting problem from start to end.
  • Second, while solving the problem learn the theory, maths & intuition behind it. In this article we will additionally focus on how to initialize the weights of a neural network.

So, let’s dive into the problem straightaway.

The final code will be provided in the last section of this article. Throughout the article section-wise code snippets are provided.

Problem Statement

Say you have a distribution of observations (x1 & x2) which look like a doughnut. How can you separate the inner layer from the outer layer.

The doughnut problem

Create a classifier to separate the blue dots from the red dots.

This data can be generated as below –

def load_dataset():
np.random.seed(1)
train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)
np.random.seed(2)
test_X, test_Y = sklearn.datasets.make_circles(n_samples=100, noise=.05)
# Visualize the data
plt.scatter(train_X[:, 0], train_X[:, 1], c=train_Y, s=40, cmap=plt.cm.Spectral);
train_X = train_X.T
train_Y = train_Y.reshape((1, train_Y.shape[0]))
test_X = test_X.T
test_Y = test_Y.reshape((1, test_Y.shape[0]))
return train_X, train_Y, test_X, test_Y

Step — 1 : Solution Approach & Understanding all the functions needed for the model

Here, we will implement a three-layer Neural Network —

(Linear -> Relu) → (Linear -> Relu) → (Linear->Sigmoid)

The shape of input argument X is (2, number of examples)

The shape of the output Y is (1, number of examples). “Y” is the true “label” vector containing 0 for red-dots and 1 for blue-dots.

  • As we are doing a three-layer neural networks let’s select the layer dimensions i.e. how many nodes will be there in those layers (input, hidden & output). The input layer will be 2 (as there are x1 & x2)
layers_dims = [X.shape[0], 10, 5, 1]
  • Now, we will initialize the weights based on the layer_dims. In our case, we would like to evaluate various initialization techniques and will choose which performs the best. The ones which we will try are –
"zeros", "random" & "he"
  • Finally we will iterate as many times as we want to train the model so that we reach the optimal parameters (in our case it will be “weights” & “biases”). In order to do that we would need the following in a sequential order –
1. Perform Forward Propagation (through the neural network which we have created i.e. (Linear -> Relu) → (Linear -> Relu) → (Linear->Sigmoid)2. Compute the Loss (using the loss function we will define)3. Perform Backward Propagation using Gradient Descent for training the model parameters (weights & biases).4. Use this training of parameters to update them (weights & biases) so that in each step we reduce the Loss and expect to reach the global minima.
  • After the model is run for those many iterations (epochs) and get trained, we obtain the optimal parameters (weights & biases) using which we can build a classifier which classifies the blue dots from the red dots.

Note: As we build this model, we will keep a special eye on the Initialization Part as this is our primary focus for solving this problem.

Let’s define all the functions now, one by one –

Step — 2: The model definition

In this section we will define the following functions —

a) forward_propagation()

b) compute_loss()

c) backward_propagation()

d) update_parameters()

e) model()

Note I purposely not included “initialization of parameters” as we will do it separately by experimenting with different techniques. So, that’s going to be the first function in the above sequence when you run the entire code.

Also I will not go over the theory in this article as there are lot of materials available on the web and I also have covered them in details in my earlier blog. You can find it here —

Now that we have built the model definition, except for the “initialization” bit, let’s do that. As mentioned above we will experiment with three kinds of initialization technique and see which one works better. They are —

a) Zero Initialization

b) Random Initialization

c) He initialization

Step — 3a : Zero Initialization

There are two types of parameters to initialize in a neural network and they are —

  • the weight matrices (W[1],W[2],W[3],…,W[L−1],W[L])
  • the bias vectors (b[1],b[2],b[3],…,b[L−1],b[L])

In this part we will initialize all parameters to zeros and see what happens.

Now that we have defined the initialization function. We will now run the model and get the optimal parameters and then run on the test set to see our prediction. So, before we run the model, let’s define the predict().

Now let’s run the model —

parameters = model(train_X, train_Y, initialization = "zeros")

This is how the output looks like —

And now predict with these optimal parameters —

print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

And the output goes like this —

The performance is really bad, and the cost does not really decrease, and the algorithm performs no better than random guessing. Lets look at the details of the predictions and the decision boundary:

print ("predictions_train = " + str(predictions_train))
print ("predictions_test = " + str(predictions_test))

So, as we see above the model is predicting 0 for every example.

In general, initializing all the weights to zero results in the network failing to break symmetry.

This means that every neuron in each layer will learn the same thing, and you might as well be training a neural network with n[l]=1 for every layer, and the network is no more powerful than a linear classifier such as logistic regression.

So, the takeaway from this experiment is —

The weights W[l] should be initialized randomly to break symmetry.

It is however okay to initialize the biases b[l] to zeros. Symmetry is still broken so long as W[l] is initialized randomly.

Step — 3b : Random Initialization

To break symmetry, lets initialize the weights randomly. Following random initialization, each neuron can then proceed to learn a different function of its inputs. In this experiment, we will see what happens if the weights are initialized randomly, but to very large values.

You can see that to have large random number as weights, we have multiplied it by 10.

So, now let’s run the model and predict the test set —

parameters = model(train_X, train_Y, initialization = "random")
print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

It looks like we have broken symmetry, and this gives better results. than before. The model is no longer outputting all 0s.

But here are few observations —

  • The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when log(a[3])=log(0), the loss goes to infinity.
  • Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.
  • If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.

So, following are our takeaways from this exercise —

Initializing weights to very large random values does not work so well.

Hopefully initializing with small random values does better. The important question is: how small ?

The next part tries to answer how small random values we should use to initialize our parameters.

Step — 3c : He Initialization

It is pronounced as “Hey”. (I think so :)). This is named for the first author of He et al., 2015.

For a network where we don’t want to blow up Z as we have seen before, and not become too small (pls visualize how logistic function looks like), larger “n” (number of nodes in a layer) is we want smaller “w” to be as they Z is the linear combination of “w” and “n”.

One reasonable thing could be done is to set the variance of w(i) to be equal 1/n. So, in practice if we are using the activation as Relu, 2/n works better than 1/n. So, let’s put this up mathematically —

Now, let’s run the model —

parameters = model(train_X, train_Y, initialization = "he")
print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

The model with He initialization separates the blue and the red dots very well in a small number of iterations.

Source Code

Sources

  1. Deep Learning Specialization by Andrew Ng & team.