Source: Deep Learning on Medium
I’d like to invite you to join me on an exploration through different approaches to initializing layer weights in neural networks. Step-by-step, through various short experiments and thought exercises, we’ll discover why adequate weight initialization is so important in training deep neural nets. Along the way we’ll cover various approaches that researchers have proposed over the years, and finally drill down on what works best for the contemporary network architectures that you’re most likely to be working with.
The examples to follow come from my own re-implementation of a set of notebooks that Jeremy Howard covered in the latest version of fast.ai’s Deep Learning Part II course, currently being held this spring, 2019, at USF’s Data Institute.
Why Initialize Weights
The aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network. If either occurs, loss gradients will either be too large or too small to flow backwards beneficially, and the network will take longer to converge, if it is even able to do so at all.
Matrix multiplication is the essential math operation of a neural network. In deep neural nets with several layers, one forward pass simply entails performing consecutive matrix multiplications at each layer, between that layer’s inputs and weight matrix. The product of this multiplication at one layer becomes the inputs of the subsequent layer, and so on and so forth.
For a quick-and-dirty example that illustrates this, let’s pretend that we have a vector x that contains some network inputs. It’s standard practice when training neural networks to ensure that our inputs’ values are scaled such that they fall inside such a normal distribution with a mean of 0 and a standard deviation of 1.
Let’s also pretend that we have a simple 100-layer network with no activations , and that each layer has a matrix a that contains the layer’s weights. In order to complete a single forward pass we’ll have to perform a matrix multiplication between layer inputs and weights at each of the hundred layers, which will make for a grand total of 100 consecutive matrix multiplications.
It turns out that initializing the values of layer weights from the same standard normal distribution to which we scaled our inputs is never a good idea. To see why, we can simulate a forward pass through our hypothetical network.
Whoa! Somewhere during those 100 multiplications, the layer outputs got so big that even the computer wasn’t able to recognize their standard deviation and mean as numbers. We can actually see exactly how long that took to happen.
The activation outputs exploded within 29 of our network’s layers. We clearly initialized our weights to be too large.
Unfortunately, we also have to worry about preventing layer outputs from vanishing. To see what happens when we initialize network weights to be too small — we’ll scale our weight values such that, while they still fall inside a normal distribution with a mean of 0, they have a standard deviation of 0.01.
During the course of the above hypothetical forward pass, the activation outputs completely vanished.
To sum it up, if weights are initialized too large, the network won’t learn well. The same happens when weights are initialized too small.
How can we find the sweet spot?
Remember that as mentioned above, the math required to complete a forward pass through a neural network entails nothing more than a succession of matrix multiplications. If we have an output y that is the product of a matrix multiplication between our input vector x and weight matrix a, each element i in y is defined as
where i is a given row-index of weight matrix a, k is both a given column-index in weight matrix a and element-index in input vector x, and n is the range or total number of elements in x. This can also be defined in Python as:
y[i] = sum([c*d for c,d in zip(a[i], x)])
We can demonstrate that at a given layer, the matrix product of our inputs x and weight matrix a that we initialized from a standard normal distribution will, on average, have a standard deviation very close to the square root of the number of input connections, which in our example is √512.
This property isn’t surprising if we view it in terms of how matrix multiplication is defined: in order to calculate y we sum 512 products of the element-wise multiplication of one element of the inputs x by one column of the weights a. In our example where both x and a are initialized using standard normal distributions, each of these 512 products would have a mean of 0 and standard deviation of 1.
It then follows that the sum of these 512 products would have a mean of 0, variance of 512, and therefore a standard deviation of √512.
And this is why in our example above we saw our layer outputs exploding after 29 consecutive matrix multiplications. In the case of our bare-bones 100-layer network architecture, what we’d like is for each layer’s outputs to have a standard deviation of about 1. This conceivably would allow us to repeat matrix multiplications across as many network layers as we want, without activations exploding or vanishing.
If we first scale the weight matrix a by dividing all its randomly chosen values by √512, the element-wise multiplication that fills in one element of the outputs y would now, on average, have a variance of only 1/√512.
This means that the standard deviation of the matrix y, which contains each of the 512 values that are generated by way of the matrix multiplication between inputs x and weights a, would be 1. Let’s confirm this experimentally.
Now let’s re-run our quick-and-dirty 100-layer network. As before, we first choose layer weights at random from standard normal distribution inside [-1,1], but this time we scale those weights by 1/√n, where n is the number of network input connections at a layer, which is 512 in our example.
Success! Our layer outputs neither exploded nor vanished, even after 100 of our hypothetical layers.
While at first glance it may seem like at this point we can call it a day, real-world neural networks aren’t quite as simple as our first example may seem to indicate. For the sake of simplicity, activation functions were omitted. However, we’d never do this in real life. It’s thanks to the placement of these non-linear activation functions at the tail end of network layers, that deep neural nets are able create close approximations of intricate functions that describe real-world phenomena, which can then be used to generate astoundingly impressive predictions, such as the classification of handwriting samples.
Up until a few years ago, most commonly used activation functions were symmetric about a given value, and had ranges that asymptotically approached values that were plus/minus a certain distance from this midpoint. The hyperbolic tangent and softsign functions exemplify this class of activations.
We’ll add a hyperbolic tangent activation function after each layer our hypothetical 100-layer network, and then see what happens when we use our home-grown weight initialization scheme where layer weights are scaled by 1/√n.
The standard deviation of activation outputs of the 100th layer is down to about 0.06. This is definitely on the small side, but at least activations haven’t totally vanished!
As intuitive as the journey to discovering our home-grown weight init strategy may now seem in retrospect, you may be surprised to hear that as recently as 2010, this was not the conventional approach for initializing weight layers.
When Xavier Glorot and Yoshua Bengio published their landmark paper titled Understanding the difficulty of training deep feedforward neural networks, the “commonly used heuristic” to which they compared their experiments was that of initializing weights from a uniform distribution in [-1,1] and then scaling by 1/√n.
It turns out this “standard” approach doesn’t actually work that well.
Re-running our 100-layer tanh network with “standard” weight initialization caused activation gradients to become infinitesimally small — they’re just about as good as vanished.
This poor performance is actually what spurred Glorot and Bengio to propose their own weight initialization strategy, which they referred to as “normalized initialization” in their paper, and which is now popularly termed “Xavier initialization.”
Xavier initialization sets a layer’s weights to values chosen from a random uniform distribution that’s bounded between
where nᵢ is the number of incoming network connections, or “fan-in,” to the layer, and nᵢ₊₁ is the number of outgoing network connections from that layer, also known as the “fan-out.”
Glorot and Bengio believed that Xavier weight initialization would maintain the variance of activations and back-propagated gradients all the way up or down the layers of a network. In their experiments they observed that Xavier initialization enabled a 5-layer network to maintain near identical variances of its weight gradients across layers.
Conversely, it turned out that using “standard” initialization brought about a much bigger gap in variance between weight gradients at the network’s lower layers, which were higher, and those at its top-most layers, which were approaching zero.
To drive the point home, Glorot and Bengio demonstrated that networks initialized with Xavier achieved substantially quicker convergence and higher accuracy on the CIFAR-10 image classification task.
Let’s re-run our 100-layer tanh network once more, this time using Xavier initialization:
In our experimental network, Xavier initialization performs pretty identical to the home-grown method that we derived earlier, where we sampled values from a random normal distribution and scaled by the square root of number of incoming network connections, n.
Conceptually, it makes sense that when using activation functions that are symmetric about zero and have outputs inside [-1,1], such as softsign and tanh, we’d want the activation outputs of each layer to have a mean of 0 and a standard deviation around 1, on average. This is precisely what our home-grown method and Xavier both enable.
But what if we’re using ReLU activation functions? Would it still make sense to want to scale random initial weight values in the same way?
To see what would happen, let’s use a ReLU activation instead of tanh in one of our hypothetical network’s layers and observe the expected standard deviation of its outputs.
It turns out that when using a ReLU activation, a single layer will, on average have standard deviation that’s very close to the square root of the number of input connections, divided by the square root of two, or √512/√2 in our example.
Scaling the values of the weight matrix a by this number will cause each individual ReLU layer to have a standard deviation of 1 on average.
As we showed before, keeping the standard deviation of layers’ activations around 1 will allow us to stack several more layers in a deep neural network without gradients exploding or vanishing.
This exploration into how to best initialize weights in networks with ReLU-like activations is what motivated Kaiming He et. al. to propose their own initialization scheme that’s tailored for deep neural nets that use these kinds of asymmetric, non-linear activations.
In their 2015 paper, He et. al. demonstrated that deep networks (e.g. a 22-layer CNN) would converge much earlier if the following input weight initialization strategy is employed:
- Create a tensor with the dimensions appropriate for a weight matrix at a given layer, and populate it with numbers randomly chosen from a standard normal distribution.
- Multiply each randomly chosen number by √2/√n where n is the number of incoming connections coming into a given layer from the previous layer’s output (also known as the “fan-in”).
- Bias tensors are initialized to zero.
We can follow these directions to implement our own version of Kaiming initialization and verify that it can indeed prevent activation outputs from exploding or vanishing if ReLU is used at all layers of our hypothetical 100-layer network.
As a final comparison, here’s what would happen if we were to use Xavier initialization, instead.
Ouch! When using Xavier to initialize weights, activation outputs have almost completely vanished by the 100th layer!
Incidentally, when they trained even deeper networks that used ReLUs, He et. al. found that a 30-layer CNN using Xavier initialization stalled completely and didn’t learn at all. However, when the same network was initialized according to the three-step procedure outlined above, it enjoyed substantially greater convergence.
The moral of the story for us is that any network we train from scratch, especially for computer vision applications, will almost certainly contain ReLU activation functions and be several layers deep. In such cases, Kaiming should be our go-to weight init strategy.
Yes, You Too Can Be a Researcher
Even more importantly, I’m not ashamed to admit that I felt intimidated when I saw the Xavier and Kaiming formulas for the first time. What with their respective square roots of six and two, part of me couldn’t help but feel like they must have been the result of some sort of oracular wisdom I couldn’t hope to fathom on my own. And let’s face it, sometimes the math in deep learning papers can look a lot like hieroglyphics, except with no Rosetta Stone to aid in translation.
But I think the journey we took here showed us that this knee-jerk response of feeling of intimidated, while wholly understandable, is by no means unavoidable. Although the Kaiming and (especially) the Xavier papers do contain their fair share of math, we saw firsthand how experiments, empirical observation, and some straightforward common sense were enough to help derive the core set of principals underpinning what is currently the most widely-used weight initialization scheme.
Alternately put: when in doubt, be courageous, try things out, and see what happens!