Original article was published by Reza Bagheri on Deep Learning on Medium
Weight Initialization in Deep Neural Networks
Learn about different weight initialization methods in neural networks
Weight and bias are the adjustable parameters of a neural network, and during the training phase, they are changed using the gradient descent algorithm to minimize the cost function of the network. However, they must be initialized before one can start training the network, and this initialization step has an important effect on the network training. In this article, I will first explain the importance of the wight initialization and then discuss the different methods that can be used for this purpose.
Currently Medium supports superscripts only for numbers, and it has no support for subscripts. So to write the name of the variables, I use this notation: Every character after ^ is a superscript character and every character after _ (and before ^ if its present) is a subscript character. For example
is written as w_ij^[l] in this notation.
Feedforward neural networks
Before we discuss the weight initialization methods, we briefly review the equations that govern the feedforward neural networks. For a detailed discussion of these equations, you can refer to reference . Suppose that you have a feedforward neural network as shown in Figure 1. Here
is the network’s input vector. Each x_i is an input feature. The network has L layers and the number of neurons in layer l is n^[l]. The input layer is considered as layer zero. So the number of input features is n^. The output or activation of neuron i in layer l is a_i^[l].
The wights for the neuron i in layer l can be represented by the vector
where w_ij^[l] represents the weight for the input j (coming from neuron j in layer l-1) going into neuron i in layer l (Figure 2).
In layer l, each neuron receives the output of all the neurons in the previous layer multiplied by its weights, w_i1, w_i2, . . . , w_in. The weighted inputs are summed together, and a constant value called bias (b_i^[l]) is added to them to produce the net input of the neuron
The net input of neurons in layer l can be represented by the vector
Similarly, the activation of neurons in layer l can be represented by the activation vector
So Eq. 3 can be written as
where the summation has been replaced by the inner product of the weight and activation vectors. The net input is then passed through the activation function g to produce the output or activation of neuron i
We usually assume that the input layer is the layer zero and
So for the first layer, Eq. 7 is written as
We can combine all the weights of a layer into a weight matrix for that layer
So W^[l] is an n^[l] × n^[l-1] matrix, and the (i,j) element of this matrix gives the weight of the connection that goes from the neuron j in layer l-1 to the neuron i in layer l. We can also have a bias vector for each layer
Now we can write Eq. 6 in a vectorized form
And using Eq. 7 we get
We usually use yhat to denote the activation of the output layer
And the vector y to denote the actual label of the input vector (Eq. 1). For a binary classification y only has one element and can be considered a scalar. But for multiclass and multilabel classifications, it is a one-hot or multi-hot encoded vector (refer to  for more details). During the backpropagation, we first calculate the error of neuron i in the last layer. The error is defined as the partial derivative of the loss function with respect to the net input
The error is a measure of the effect of this neuron in changing the loss function of the whole network. Then we use this error term to calculate the error of neurons in the previous layer
In this way, we calculate the error term of each layer using that of the previous layer until we reach the first layer. We can calculate the gradient of the loss function with respect to weight and bias in each layer using the error term of that layer
And using them we can update the values of weights and gradients for the next step of the gradient descent
where J is the cost function of the network. The weights and biases are updated until they converge to their optimum values that minimize the cost function.
The importance of random weight initialization
The simplest method that we can use for weight initialization is assigning a constant number to all the weights. For example, we can initialize all the weights with zero. However, it turns out to be a bad idea. Let me explain it in more detail. Assume that we have a neural network (called network A) with L layers and n^[l] neurons in each layer. Now for each layer of the network, we initialize the weight matrix with a constant value ω^[l] and the bias vector with a constant value β^[l]. So in layer l we have
for all values of i and j. Please note that two different layers can have different values of ω^[l] and β^[l]. If we initialize the weights and biases using Eq. 21, then it can be shown that in each step of gradient descent the weights and biases in each layer are the same (the proof is given in the appendix). So we can assume that after training network A on a data set, its weights and biases converge to ω_f^[l] and β_f^[l]. So in each layer, the weights and biases are the same for all the neurons.
Now assume that we have a second network (called network B) with the same number of layers, and it only has one neuron in each layer. To be able to compare the networks A and B, we use the superscript <B> to indicate the quantities that belong to network B. Both networks are shown in Figure 3.
At each layer, both networks have the same activation functions, and they also have the same input features, so
We initialize all the bias values with β^[l] (from Eq. 21). For the first layer, We initialize the weight matrix (Eq. 10) with the same values of network A
Since we only have one neuron and n^<B> input features, the weight matrix is indeed a row vector. Each element of this matrix is the constant value ω_f^. For the next layers, we define the weight matrix as
where n^[l] is the number of neurons in layer l in network A. Since we only have one neuron with one input in layers l≥1, the weight matrix has only one element, and that element is ω_f^[l] n^[l]. Hence for each layer l≥1 in network B, we initialize the weight matrix with the weights of network A multiplied by the number of neurons in the same layer of network A. Now we can easily show (the proof is given in the appendix) that network B is equivalent to network A which means that for the same input vector, they produce the same output during the gradient descent and after convergence.
So by using this symmetric weight initialization, network A behaves like network B which has a limited learning capacity, however, the computational cost remains the same. Initializing all weights and biases of the network with the same values is a special case of this method which leads to the same problem. The worst case is that we initialize all the weights with zero. In that case, according to Eq. 16, the error term for all the layers except the last one will be zero, so the gradients of the lost function will be zero too (Eqs. 19 and 20). As a result, the weights and biases in these layers won’t be updated, and the network cannot be trained at all.
A neural network can be thought of as a matrix with two elements. It has a depth which is the number of layers, and a width which is the number of neurons in each layer (assuming that all the layers have the same number of neurons for the sake of simplicity). Using a linear activation function in all the layers shrinks the depth of the network, so it behaves like a network with only one layer (the proof is given in ). Using symmetric weight and bias initialization will shrink the width of the network, so it behaves like a network with only one neuron in each layer (Figure 4).
So to break the symmetry either the weights or the biases should not be initialized in the way. In practice, we use random initialization for weights and initialize all the biases with zero or a small number. The weights are chosen for this purpose since their proper random initialization not only breaks the symmetry but also helps with the vanishing and exploding gradient problems which are discussed in the next section.
Vanishing and exploding gradients
Using the backpropagation equations (Eqs. 15 and 16), we can calculate the error term for any layer in the network. Suppose that we want to calculate it for layer l. We first calculate the error term for the output layer and then move backward and calculate the error term for the previous layers until we reach layer l. It can be shown that the error term for layer l is
You can refer to  for the derivation of this equation. Based on this equation, each element of the error vector (which is the error for one of the neurons in that layer) is proportional to chained multiplications of the weights of the neurons in the next layers. Now if the weights are small numbers, the chained multiplications of these weights can result in an extremely small error term especially if you have a deep network with so many layers. Hence, some or all of the elements of the error vector will be extremely small.
Based on the Eqs. 17 and 18, the gradients of the loss function and cost function are proportional to the error term, so they will also become a very small number which results in a very small step size for weight and bias update in gradient descent (Eqs. 19 and 20). The final result is the slow-down of the gradient descent method and the network’s learning process. This is called a vanishing gradient problem.
It is also possible that the weights are very large numbers. Then, the chained multiplication of these becomes a very large number, and the gradients and step size in the gradient descent can explode. The result is an unstable network, and gradient descent steps cannot converge to the optimal values of weight and biases since the steps are now too big and miss the optimal point. This is called an exploding gradient problem. We can use the weight initialization techniques to address these problems.
Mean and variance rules
Before we discuss the initialization methods, we need to review some of the properties of mean and variance. We denote the mean of a random variable X with E[X] and its variance with Var(X). If X_1, X_2, . . . , X_n are independent random variables with finite means, and if a_1, a_2, . . . , a_n and b are arbitrary constants, then
In addition, If X and Y are two independent random variables, then we have
Variance can be also expressed in terms of the mean. If we have a random variable X, then
Random initialization methods
The initialization methods that will be introduced in the next sections are based on random weight initialization to break the symmetry. However, since the weights are not symmetric anymore, we can safely initialize all the bias values with the same value. So in all these methods, the bias values are initialized with zero. The initialization methods are based on some assumptions that will be discussed in the next section.
1- We assume that the weights for each layer are independent and identically distributed (IID). They are initialized with a uniform or normal distribution with a mean of 0 and variance of Var(w^[l])
The weights in each layer are independent of the weights in other layers.
2-The feature inputs are also assumed to be independent and identically distributed (IID). The feature inputs are independent of the weights. In addition, they are normalized, so
We also need to make an assumption about the activation function. We have two types of activation functions. The ones that are differentiable at z=0 (like sigmoid) and the ones that are not (like ReLU). If the activation function is differentiable at z=0, then we can use the Maclaurin series to approximate it when z is small. The Maclaurin series of a function is defined as
which can be used to calculate an approximation of f(x) when x is close to zero. The Maclaurin series of tanh is
When z is close to zero we can ignore the larger powers of z and write
The Maclaurin series of the sigmoid is
and for small values of z we can write
So when z is close to zero, sigmoid and tanh can be approximated with a linear function and we say that we are in the linear regime of these functions. Since we assume that the input features are normalized, their values are relatively small in the first iteration, and if we initialize the weights with small numbers, the net input of neurons (z_i^[l]) will be small initially. So we assume that:
3-The activation functions are in the linear regime at the first iteration.
Now based on these assumptions we can make some conclusions:
1-During the first iteration of gradient descent, the weights of neurons in each layer, and the activations of the neurons in the previous layer are mutually independent. In addition, in each layer all activations are independent.
Based on Eqs. 3, 7 and 9 we have
where the biases are assumed to be zero. So a_k^[l-1] can be calculated recursively from the activations of the previous layer until we reach the first layer, and a_i^[l] is a non-linear function of the input features and the weights of layers 1 to l
Since the weights in each layer are independent, and they are also independent of x_j and the weights of other layers, they will be also independent of a function of weights and x_j (f in Eq. 42). So w_kp^[l] and a_i^[l-1] will be independent for all values of i, p, k, and l.
In addition, since all the weights are independent and the input features are independent too, the functions of them (f(w_kp^[m], x_j)) are also independent. So in layer l-1 all a_i^[l-1] are independent which means that in each layer all the activations are independent. Of course, this is not true for that output layer if we have the softmax activation function there. Softmax is defined as
The output of each neuron in the softmax activation function is a function of the output of other neurons since they should sum to 1. However, since the values of z are small in the first iteration, we can write
So the output of the softmax function is roughly the same for all neurons and is only a function of the number of neurons in the output layer. Hence, we can assume the activations still don’t depend on each other or the weights of that layer.
2- During the first iteration, the mean of the net input in each layer is zero.
Using Eqs. 6, 26, and 28 and the fact that weights and activations are mutually independent in the first iteration, we can write
Based on the first assumption the mean of the weights is zero (Eq. 31). So for all values of l we have
Similarly, we can use Eq. 6, 27, and 29 to write
Using Eqs. 31 and 32, the previous equation can be simplified
LeCun weight initialization
This method was first proposed by LeCun et al . As mentioned before, we want to prevent the vanishing or explosion of the gradients during the backpropagation. So we shouldn’t allow the error in Eq. 25 to vanish or explode. The errors in each layer are a function of the errors of the output layer (δ^[L]). On the other hand, the errors of the output layer, are a function of the activations of the output layer (Eq. 15). So if during the forward propagation, the activations vanish or explode, the same thing happens for the errors. As a result, we should prevent the exploding or vanishing of the activations in each layer during the forward propagation. The variance is representative of the spread of data around its mean, so if the mean and variance of the activations in layer l is roughly equal to that of layer l-1, then it means that the activations don’t vanish or explode traveling from layer l-1 to layer l. So for all values of i and j we should have two conditions
For l=1, the activations of the previous layer are the input features (Eq. 8), and their variance is equal to 1 (Eq. 34). So the previous equation can be written as
This LeCun method only works for the activation functions that are differentiable at z=0. It was initially derived for the tanh activation function, but can be also extended for sigmoid.
Tanh activation functions
For tanh we can use Eqs. 37 and 48 to write
Using Eqs. 37, 46, and 49 we get:
This equation is true for all values of l. So the condition in Eq. 49 is satisfied, and the mean of activations doesn’t change in different layers. By substituting Eq. 53 into Eq. 52 and using the fact that the variance of all activation in a layer is the same (Eq. 51), we get
Now to satisfy the condition of Eq. 51 we should have
This is the LeCun Initialization formula. If we assume that the weights have a normal distribution, then we need to pick the weights from a normal distribution with a mean of zero and a variance of 1/n^[l-1]. We can also use a uniform distribution for the weights. If we have a uniform distribution over the interval [a, b], its mean will be
And its variance will be
So if we pick the weights in each layer from a uniform distribution over the interval
its mean will be zero and its variance will be the same as the variance given in Eq. 55.
Xavier weight initialization
The Lecun method only takes into account the forward propagation of the input signal. We can extend the previous discussion to backpropagation too. Based on that Xavier Glorot et al  suggested another method that includes the backpropagation of the signal.
The error of each neuron in the output layer is given in Eq. 15. So δ^[L] is a function of the activations of the output layer (yhat) and the label vector (y). We already showed that in each layer all activations are independent. For binary classification y only has one element (which is the scalar y in that case). For multiclass and mutlilabel classifications, it is either a one-hot or multi-hot encoded vector, and obviously, all the elements are independent of each other. So the error of the neurons in the output layer are functions of some independent variables, they will be independent of each other. For tanh from Eq. 37, we get
By substituting this equation into Eq. 16 we have
So δ_i^[l] can be calculated recursively from the error of the next layer until we reach the output layer, and it is a linear function of the errors of the output layer and the weights of layers l+1 to L
We already know that all the weights of layer l (w_ik^[l]) are independent. The errors of the output layer are independent. The weights and errors are not completely independent. Since error depends on the activation of the output layer which can be written as a function of the weights of the networks (Eq. 42). However, each weight w_pk^[l] is in only used once to produce the activation of neuron p in layer l. Since we have so many layers and usually so many neurons in each layer, the effect of a single weight on the activations and errors of the output layer is negligible, so we can assume that each activation in the output layer is independent of each weight in the network.
As a result, we can also assume that the error in each layer is independent of the weights of that layer. Now using this assumption and Eqs. 26, 28, and 57 we have
Now we can use this equation and Eqs. 27, 29, 31, and 32 to write
Based on this equation δ _i^[l] is not a function of i which means that the variance of all the errors in each layer is the same
So we can simplify Eq. 60 and write
Similar to forward propagation, the mean of the error is the same for all layers (Eq. 59), and we want the variance to remain the same. So from Eq. 62, we get
As you see in the backpropagation, the variance of the weights in each layer is equal to the reciprocal of the number of neurons in that layer, however, in the forward propagation, is equal to the reciprocal of the number of neurons in the previous layer. To resolve this conflict we can pick the weights of each layer from a normal distribution with a zero mean and a variance of
This variance is the harmonic mean of the variances given in Eqs. 55 and 63. This is the Xavier initialization formula. We can also use a uniform distribution for the weights. So if we pick the weights in each layer from a uniform distribution over the interval
its mean will be zero and its variance will be the same as the variance given in Eq. 64.
Kumar initialization for sigmoid activation functions
Kumar  studied the weight initialization in neural networks. He suggested a general weight initialization strategy for any arbitrary differentiable activation function, and used it to derive the initialization parameters for the sigmoid activation function. Here we are not going to discuss his method. Instead, we extend the Xavier method to use it for a sigmoid activation function. We can use Eqs. 27, 39, and 48 to write
Using Eqs. 26, 39, and 46 we get
By substituting Eq. 66 into Eq. 65 and using the fact that the variance of all activations in a layer is the same (Eq. 50), we get
But we know that the variance of the activations in each layer is 1 (Eq. 51), so we can simplify the previous equation
This is the result that was obtained by Kumar, and he believes that there is no need to set another constraint for the variance of the activations during backpropagation. So you can pick the weights from a normal or uniform distribution with the variance given in Eq. 68. However, we can also study the backpropagation. For the backpropagation, we first need to calculate the mean of the errors. From Eq. 39 we have
The mean of the errors is
And their variance will be
So to keep the variance of different layers the same, we should have
If we assume that the weights have a normal distribution, it has a zero mean and its variance can be taken from a harmonic mean of Eqs. 68 and 72.
He weight initialization
This method was first proposed by He et al . If we have an activation function which is not differentiable at z=0 (like ReLU), then we cannot use the Maclaurin series to approximate it. As a result, the Xavier method cannot be used anymore, and we should use a different approach.
ReLU activation function
ReLU is a widely-used non-linear activation function defined as
It is not differentiable at z=0, and we usually assume that its derivative is 0 or 1 at this point to be able to do the backpropagation. However, we cannot use the Maclaurin series to approximate it when z is close to zero. From Eq. 49 we have
for all values of l. To use the He initialization method we start with Eq. 48, and we use Eqs. 30, 51, and 74 to simplify it
Based on Eq. 75, the variance of z_i^[l] is the same for all the neurons in layer l. So we can write
Recall that we assumed that the weights have a uniform or normal distribution with a mean of zero. So they should have a symmetric distribution around zero. We know that
So z_i^[l] can be considered as a linear combination of the weights. We also know that its mean is zero (Eq. 46), so it should also have a symmetric distribution around zero. Hence, its distribution is an even function. Now we can write
since the integrand is an even function. Based on the definition of ReLU activation (Eq. 73), we have
So using Eqs. 77 and 78 we have
Now we can use Eq. 30 and write
So using Eqs. 74 and 76 we get
By replacing Eq. 81 into Eq. 75 we have
To prevent the exploding or vanishing of the activations in each layer during the forward propagation, we should make sure that the net inputs don’t explode or vanish, so the variance of the net inputs in layer l should be roughly equal to that of layer l-1. So from the previous equation, we conclude that
As mentioned before, though ReLU is not differentiable at z=0, we assume that its derivative is zero or one at this point (here we assume it is one). So the derivative of ReLU is
Since half of the values of g’(z) are 1 and the other half are zero, its mean will be
and the distance of each value of g’(z) from its mean will be 0.5. So we can write
g’(z_i^l) is a function of z_i^l, and δ_k^[l+1] has a very weak dependence on z_i^[l], so we can assume that g’(z_i^l) and δ_k^[l+1] are independent. In addition, g’(z_i^l) is independent of the weights in layer l+1. Hence we can write
Based on Eqs. 16 and 29 we can write
As we showed for the Xavier method, the effect of a single weight on the activations and errors of the output layer is negligible, so we can assume that each activation in the output layer is independent of each weight in the network. So
and the last term on the right-hand side of Eq. 88 becomes zero. By plugging the mean and variance of g’(z) into Eq. 88 we get
Now we can use Eqs. 29, 31, 32, and 87 to simplify it
The right-hand side of this equation does not depend on i, so the variance of all errors in layer l be the same, and this is also true for all the other layers. So, we can write
Similar to the Xavier method, the mean of the error is the same for all layers, and we want its variance to remain the same. So from Eq. 91, we get
This variance can be expressed as the harmonic mean of the variances given in Eqs. 83 and 92
So we can pick the weights from a normal distribution with a mean of zero and a variance of Eq. 93.