Steps to basic modern NN model from scratch

Source: Deep Learning on Medium

Step:2&3 — Relu/init & Forward pass

After we have defined the matrix multiplication strategy, its time to defined the ReLU function and the forward pass for the Neural Network. I would request the readers to go through the Part — 1 of the series to get the background of the data used below.

The Neural Network is defined as below:

output = MSE(Linear(ReLU(Linear(X))))

Basic Architecture

n,m = x_train.shape
c = y_train.max()+1
n,m,c

Let us explain the weights for the matrix multiplication.

I will create a 2-layer neural network.

  • The first linear layer will do the matrix multiplication of the input with w1.
  • The output of the first linear layer will be the input for the second linear operation, where the input will be multiplied with the w2.
  • Instead of getting ten predictions for a single input, I will obtain the single output and will use MSE to calculate the loss.
  • Let us declare the weights and biases.
w1 = torch.randn(m,nh)
b1 = torch.zeros(nh)
w2 = torch.randn(nh,1)
b2 = torch.zeros(1)
  • When we declare the weights and biases using the PyTorch randn , then the weights and biases obtained are normalized, i.e. they have a mean of 0 and a standard deviation of 1.
  • We need to normalise the weights and biases so that they do not lead to substantial values after the linear operation with the input to the Neural Network. Because large outputs become difficult for the computers to handle. Therefore, we prefer to normalise the inputs.
  • For the same reason, we want out input matrix to have a mean of 0 and a standard deviation of 1, which is not at present. Lets us see.
train_mean,train_std = x_train.mean(),x_train.std()
train_mean,train_std
  • Let us define a function to normalise the input matrix.
def normalize(x, m, s): return (x-m)/sx_train = normalize(x_train, train_mean, train_std)
x_valid = normalize(x_valid, train_mean, train_std)
  • Now, we are normalising the training dataset and the validation dataset on the same mean and standard deviation so that our training and validation dataset has the same feature definitions and scale. Now, let’s recheck the mean and standard deviation.
Mean approx. to 0 & SD approx. to 1
  • Now, we have normalised weights, biases, and input matrix.

Let us define the linear layer for the Neural Network and perform the operation.

def lin(x, w, b): return x@w + bt = lin(x_valid, w1, b1)
t.mean(),t.std()

Now, the mean and standard deviation obtained after the linear operation is again non-normalized. Now, the problem is still pure. If it remains like this, more linear operations will lead to significant and substantial values, which will be challenging to handle. Thus, we want our activations after the linear operation to be normalized as well.

Simplified Kaiming Initialization or He Initialization

To handle the non-normalized behaviour of linear neural network operation, we define weights to be Kaiming initialized. Though Kaiming Normalization or He initialisation is defined to handle ReLu/Leaky ReLu operation, we still can use it for linear operations.

We divide our weights by math.sqrt(x) where x is the number of rows.

After the above trivia, we get the normalized mean and SD.

def lin(x, w, b): return x@w + bt = lin(x_valid, w1, b1)
t.mean(),t.std()

Let us define the ReLU layer for the Neural Network and perform the operation. Now, why we are defining the ReLU as a non-linear activation function, I hope you are aware of the Universal Approximation Theorem.

def relu(x): return x.clamp_min(0.)
t = relu(lin(x_valid, w1, b1))
t.mean(),t.std()
t = relu(lin(x_valid, w1, b1))
t = relu(lin(t, w2, b2))
t.mean(),t.std()

Did you notice something weird?

Notice above that our standard deviation gets halved of the one obtained after the linear operation, and if it gets halved after one layer, imagine after eight layers it will get to 1/²⁸, which is very very small. And if our neural network has got 10000 layers 😵, forget about it.

From PyTorch docs:
a: the negative slope of the rectifier used after this layer (0 for ReLU by default)

This was introduced in the paper that described the Imagenet-winning approach from Kaiming He and others: Delving Deep into Rectifiers, which was also the first paper that claimed “super-human performance” on Imagenet (and, most importantly, it introduced ResNets!)

Thus, following the same strategy, we will multiply our weights with math.sqrt(2/m) .

w1 = torch.randn(m,nh)*math.sqrt(2/m)t = relu(lin(x_valid, w1, b1))
t.mean(),t.std()
It is still better than (0.1276, 0.5803).

Though we have better results, still the mean is not so good. As per the fastai docs, We could handle the mean by the below tweak.

def relu(x): return x.clamp_min(0.) - 0.5w1 = torch.randn(m,nh)*math.sqrt(2./m )
t1 = relu(lin(x_valid, w1, b1))
t1.mean(),t1.std()
Now, it is so much better.

Let us combine the above all code and strategies and create the forward pass of our Neural Network. PyTorch has a defined method for Kaiming Normalization i.e kaiming_normal_ .

def model(xb):
l1 = lin(xb, w1, b1)
l2 = relu(l1)
l3 = lin(l2, w2, b2)
return l3
%timeit -n 10 _=model(x_valid)

The last to be defined for the forward pass is the Loss function: MSE.

As per our previous knowledge, we generally use CrossEntroyLoss as the loss function for single-label classification functions. I will address the same later. For now, I am using MSE to understand the operation.

def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean()

Let us perform the above operations for the training dataset.

preds = model(x_train)
preds.shape, preds

To perform the MSE, we need the floats.

y_train,y_valid = y_train.float(),y_valid.float()mse(preds, y_train)

After all of the above operation, one question is still not answered substantially and it is

Why do we need Kaiming Initialization?

Let us understand it again.

Initialise two tensors as below.

import torchx = torch.randn(512)
a = torch.randn(512,512)

For Neural Networks, the primary step is the matrix multiplication and if we have a deep neural network with approx 100 layers, then let us see what will the standard deviation and mean of the activations obtained.

for i in range(100):
x = x @ a
x.mean(), x.std()
mean=nan, sd=nan

We can easily see that mean, and a standard deviation is no longer a number. And it is justified as well. The computer is not able to store that large numbers; it cannot account for such large numbers. It has restricted the practitioners to train such deep neural networks for the same reason.

The problem you’ll get with that is activation explosion: very soon, your activations will go to nan. We can even ask the loop to break when that first happens:

for i in range(100):
x = x @ a
if x.std() != x.std(): break
i
Just after 28 loops

Thus, such problems lead to the invention of Kaiming Initialization. It surely took decades to come up with the idea finally.

So, that’s how we define the ReLu and backward pass for the neural network.