INITIALIZATION METHODS

Source: Deep Learning on Medium

Why not simply initialize all weights to 0?

Consider the following neural network that uses the logistic activation function

Let us initialize all the weights to 0, we will get following

Now let look for delta w11 & w21

If we substitute h11 = h12 & a11 = a12 in that case delta w11 = delta w21

We can see that if we initialize the weights to equal values, then the equality carries through to the gradients associated with these weights, thereby keeping the weights equal throughout the training.

This is known as the Symmetry Breaking Problem, where if you start with equal initialized weights, they remain equal through the training.

Hence weights connected to the same neuron should never be initialized to the same value?

Some conclusions we can make are as follows: →

→ Never initialize all weights to 0

→ Never initialize all weights to the same value

RANDOM INITIALIZATION

As we have seen initializing weights with zero & equal values is not good, Here we will discuss initializing weight with small weights

Assigning random values to weights is better than just 0 or equal assignment. But there is one thing to keep in my mind is that what happens if weights are initialized high values or very low values and what is a reasonable initialization of weight values.

Initializing weights randomly, following standard normal distribution while working with a (deep) network can potentially lead to 2 issues — vanishing gradients or exploding gradients.

a) Vanishing gradients — In case of deep networks, for any activation function, abs(dW) will get smaller and smaller as we go backwards with every layer during back propagation. The earlier layers are the slowest to train in such a case.

The weight update is minor and results in slower convergence. This makes the optimization of the loss function slow. In the worst case, this may completely stop the neural network from training further.

More specifically, in case of sigmoid(z) and tanh(z), if your weights are large, then the gradient will be vanishingly small, effectively preventing the weights from changing their value. This is because abs(dW) will increase very slightly or possibly get smaller and smaller every iteration. With RELU(z) vanishing gradients are generally not a problem as the gradient is 0 for negative (and zero) inputs and 1 for positive inputs.

b) Exploding gradients — This is the exact opposite of vanishing gradients. Consider you have non-negative and large weights and small activations A (as can be the case for sigmoid(z)). When these weights are multiplied along the layers, they cause a large change in the cost. Thus, the gradients are also going to be large. This means that the changes in W, by W — ⍺ * dW, will be in huge steps, the downward moment will increase.

This may result in oscillating around the minima or even overshooting the optimum again and again and the model will never learn!

Another impact of exploding gradients is that huge values of the gradients may cause number overflow resulting in incorrect computations or introductions of NaN’s. This might also lead to the loss taking the value NaN.

We will first see what are the ways to wrongly initialize the weights then we will see right ways to initialize them.

Why shouldn’t you initialize all the weights to large values?

Consider the following neural network that uses the logistic activation function

Here, input values are normalized (0–1) and the weights are initialized to large values, for tanh and logistic neurons will saturate for large values

This would result in the function attaining saturation.

If we will not normalize the input, then also we can face large value which leads to saturation of neurons.

Thus, a few noteworthy points are:

→ Always normalize the inputs (should lie between 0 to 1). If not, they too could contribute to saturation if their values are very large or small

→ Never initialize weights to large values

→ Never initialize all weights to 0

→ Never initialize all weights to the same value

XAVIER & HE INITIALIZATION

Before discussing further initialization techniques, let’s first discuss about fan-in & fan-out, from below figure it is clear what is fan-in and fan-out

fan-in is number of incoming network connections

fan-out is number of outgoing network connections from that layer

XAVIER INITIALIZATION:

Consider

If you look at the pre-activation for the second layer ‘a₂’, it is a weighted sum of inputs from the previous layer(output for post-activation from the first layer) and the bias. If the number of inputs to the second layer is a very large quantity, in that case, there is a possibility that the aggregation ‘a₂’ would blow up. So it makes sense that these weights should be inversely proportional to the number of input neurons present in the previous layer.

If the weights are inversely proportional to the number of input neurons, in case the number of input neurons are very large which is common in a deep neural network, all these weights will take on small values because of the inverse relationship. Hence the net post-activation aggregation would be very small. This method of initialization is known as Xavier Initialization.

More specifically i have mention 2 different ways for Xavier/Glorot initialization

From above let me explain uniform Xavier initialization sets a layer’s weights to values chosen from a random uniform distribution that’s bounded between

where nᵢ is the number of incoming network connections, or “fan-in,” to the layer, and nᵢ₊₁ is the number of outgoing network connections from that layer, also known as the “fan-out.”

Use Xavier initialization in the case of tanh and logistic activation

HE (He-et-al) INITIALIZATION

Pronounced as Hey Initialization. Introduced in 2015 by He-et-al, and is similar to Xavier Initialization

In He-Normal Initialization, weights in your network are drawn from a normal distribution with zero mean and a specific variance factor multiplied by two

It is used for ReLU and Leaky ReLU

Here, we divide by square root of (m/2) because of the rough intuition that in ReLU, around half the neurons die during training.

Best Practices

  1. Using RELU/ leaky RELU as the activation function, as it is relatively robust to the vanishing/exploding gradient issue (especially for networks that are not too deep). In the case of leaky RELU’s, they never have 0 gradient. Thus they never die and training continues.
  2. Xavier initialization mostly used with tanh and logistic activation function
  3. He-initialization mostly used with ReLU or it’s variants — Leaky ReLU.
  4. For deep networks, we can use a heuristic to initialize the weights depending on the non-linear activation function, different heuristic we have already explained depending on type of initialization
  5. Gradient Clipping — This is another way of dealing with the exploding gradient problem. We set a threshold value, and if a chosen function of a gradient is larger than this threshold, we set it to another value. For example, normalize the gradients when the L2 norm exceeds a certain threshold –W = W * threshold / l2_norm(W) if l2_norm(W) > threshold