Deep Learning from Foundations

Source: Deep Learning on Medium

Forward and Backward pass

def normalize(x, m, s): return (x-m)/strain_mean,train_std = x_train.mean(),x_train.std()
(tensor(0.1304), tensor(0.3073))

The mean and std are not 0 and 1, However we want them to be 0 or 1. Hence we apply the normalization function

x_train = normalize(x_train, train_mean, train_std)# NB: Use training, not validation mean for validation set
x_valid = normalize(x_valid, train_mean, train_std)
train_mean,train_std = x_train.mean(),x_train.std()
(tensor(3.0614e-05), tensor(1.))
# mean and std now are much closer to 1
n,m = x_train.shape
c = y_train.max()+1 # Number of activation's or outputs
(50000, 784, tensor(10))

Now lets Try to create model with 1 hidden layer. For simplification we will use MSE for the time being.

We create a hidden layer with number of neurons = 50 . For 2 layers we shall need 2 weight matrices and 2 bias

# num hidden
nh = 50
# simplified kaiming init / he init
w1 = torch.randn(m,nh)/math.sqrt(m)
b1 = torch.zeros(nh)
w2 = torch.randn(nh,1)/math.sqrt(nh)
b2 = torch.zeros(1)

We have input x_valid( input to layer 1 ) with mean 0 and std 1 , we want input to second layer to also be of mean 0 and std 1.But if we divide by math.sqrt(m) we achieve the same. This is a simplified version of kaiming initialization.

# This should be ~ (0,1) (mean,std)...
(tensor(-0.0058), tensor(0.9924))
def lin(x, w, b): return x@w + bt = lin(x_valid, w1, b1) should this, because we used kaiming init, which is designed to do this
(tensor(0.0004), tensor(0.9786))

torch.randn(m,nh) gives a mean of 0 and std of 1 and torch.randn(m,nh)/math.sqrt(m) gives std of 1/sqrt(m)

Careful initialization is key for good NN performance

t = lin(x_valid, w1, b1) # is not how first layer defined

Initial layer is defined by relu.

def relu(x): return x.clamp_min(0.)
t = relu(lin(x_valid, w1, b1))
#...actually it really should be this!
(tensor(0.3875), tensor(0.5665))

Post relu the out put does not have mean 0 and std 1. This was solved by kaiming initialization.

  • If your variance is halved every layer. Then after 8 layers the variance this very very low.
  • We can solve the problem of having low variance post relu by torch.randn(m,nh)*math.sqrt(2/m) by replanting 1 by 2
# kaiming init / he init for relu
w1 = torch.randn(m,nh)*math.sqrt(2/m)
# kaiming init / he init for relu
w1 = torch.randn(m,nh)*math.sqrt(2/m)
(tensor(-7.2458e-05), tensor(0.0507))
t = relu(lin(x_valid, w1, b1))
(tensor(0.5510), tensor(0.8071))

Even though the std as improved the mean is now 0.5. One solution to the problem is reduce results of relu by 0.5

# what if...?
def relu(x): return x.clamp_min(0.) - 0.5
We can now introduce init.kaiming_normal_
w1 = torch.randn(m,nh)*math.sqrt(2/m)
# is same asfrom torch.nn import init
w1 = torch.zeros(m,nh)
init.kaiming_normal_(w1, mode='fan_out')
t = relu(lin(x_valid, w1, b1))

Read 2.2 of resnet till backward prop

What is mode=’fan_out’ ?

?? init.kaiming_normal_mode: either ``'fan_in'`` (default) or ``'fan_out'``. Choosing ``'fan_in'``
preserves the magnitude of the variance of the weights in the
forward pass. Choosing ``'fan_out'`` preserves the magnitudes in the
backwards pass.

‘fan_out’ is for preserving variance in backward pass (w1 = torch.randn(m,nh)*math.sqrt(2/nh)). So why did we use fan_out and not fan_in.

torch.Size([784, 50])
torch.Size([50, 784])

The weight matrix is opposite of torch

F always refers to torch.nn.functional as short form

We have something such as math.sqrt(5) and calling .kaiming_uniform_ same as .kaiming initialization. We are unclear about where the 5 came from.

The function below for relu works well

# what if...?
def relu(x): return x.clamp_min(0.) - 0.5
def relu(x): return x.clamp_min(0.) - 0.5
# kaiming init / he init for relu
w1 = torch.randn(m,nh)*math.sqrt(2./m )
t1 = relu(lin(x_valid, w1, b1))
(tensor(0.1513), tensor(0.8884))