Source: Deep Learning on Medium

**Forward and Backward pass**

def normalize(x, m, s): return (x-m)/strain_mean,train_std = x_train.mean(),x_train.std()

train_mean,train_std(tensor(0.1304), tensor(0.3073))

The mean and std are not 0 and 1, However we want them to be 0 or 1. Hence we apply the normalization function

x_train = normalize(x_train, train_mean, train_std)# NB: Use training, not validation mean for validation set

x_valid = normalize(x_valid, train_mean, train_std)train_mean,train_std = x_train.mean(),x_train.std()

train_mean,train_std(tensor(3.0614e-05), tensor(1.))# mean and std now are much closer to 1n,m = x_train.shape

c = y_train.max()+1 # Number of activation's or outputs

n,m,c

(50000, 784, tensor(10))

Now lets Try to create model with 1 hidden layer. For simplification we will use MSE for the time being.

We create a hidden layer with number of neurons = 50 . For 2 layers we shall need 2 weight matrices and 2 bias

# num hidden

nh = 50# simplified kaiming init / he init

w1 = torch.randn(m,nh)/math.sqrt(m)

b1 = torch.zeros(nh)

w2 = torch.randn(nh,1)/math.sqrt(nh)

b2 = torch.zeros(1)test_near_zero(w1.mean())

test_near_zero(w1.std()-1/math.sqrt(m))

We have input x_valid( input to layer 1 ) with mean 0 and std 1 ,** we want input to second layer to also be of mean 0 and std 1.**But if we divide by math.sqrt(m) we achieve the same. This is a simplified version of kaiming initialization.

# This should be ~ (0,1) (mean,std)...

x_valid.mean(),x_valid.std()

(tensor(-0.0058), tensor(0.9924))deflin(x, w, b):returnx@w + bt = lin(x_valid, w1, b1)#...so should this, because we used kaiming init, which is designed to do this

t.mean(),t.std()

(tensor(0.0004), tensor(0.9786))

torch.randn(m,nh) gives a mean of 0 and std of 1 and torch.randn(m,nh)/math.sqrt(m) gives std of 1/sqrt(m)

**Careful initialization is key for good NN performance**

`t = lin(x_valid, w1, b1) # is not how first layer defined`

Initial layer is defined by relu.

defrelu(x):returnx.clamp_min(0.)

t = relu(lin(x_valid, w1, b1))#...actually it really should be this!

t.mean(),t.std()(tensor(0.3875), tensor(0.5665))

Post relu the out put does not have mean 0 and std 1. This was solved by kaiming initialization.