Why BatchNorm works

Original article can be found here (source): Deep Learning on Medium

Normalisation techniques are some of the great tools we have while analysing any form of data, a simple operation of adjusting the mean and variance of a distribution results in catastrophic success of various normalising techniques in deep neural networks, one of them being the famous batch normalisation.

Everyone’s heard of it

Not many know about its characteristics

We’ll analyse the second order properties of the error surface for models where the error space is quadratic in weights, then reparametriznig the surface in order to get an analysis in terms of eigenspectrum of the covariance matrix of the input, and thence using the results to infer why batchnorm might be improving the rate of convergence and why does it make the network invariant to initialization.

This discussion is based on

For various learning algorithms that search the space of weights {W} for optimal values W* for which the error function E(W) is minimal, based on gradient descent, their properties are controlled by second order properties of E(W) surface.
For this analysis, i’ll focus on a single layer of a neural network with X(Nx1) as input vector to this layer and W(1xN) be the weights of this layer and Y(1x1) be the
output of this layer, for convenience considering mean squared error,

E(W) = (1/2p)SUMMATION(|| Yu — W.T*Xu ||² , 1<=u<=p)p is size of dataset, Xu is uth data entry vector

E(W) = (1/2p)SUMMATION( (Yu — W.T*Xu)² , 1<=u<=p)

Now clearly error surface is quadratic in weights, therefore rewriting it :

E(W) = (1/2)(W.T*R*W — 2Q.T*W + c)

such that

(i,j)th entry in R = (1/p)SUMMATION(xui*xuj, 1<=u<=p) where xui is ith component of uth input vector(Nx1), therefore R is clearly the NxN input covariance matrix.

Q is a N-dim vector, ith component of Q = (1/p)SUMMATION(Yu*xui, 1<=u<=p)

and c is just a constant = (1/p)SUMMATION(Yu², 1<=u<=p)

Now, the gradient of E(W) w.r.t. W is J(E) = RW-Q while Hessian matrix of second derivatives is clearly H = R

The solution space W* which minimises E(W) definitely contains entries for which gradient is zero, therefore this solution space is the subspace of solutions of linear equation

RW = Q (from J(E)=0)

Now, this RW=Q is a common type of problem in linear algebra(Ax=b), many problems in engineering applications can be solved by formulating in this format and then analysing eigenspectrum of A.
As the solution space W* was a subspace of solutions of RW=Q, now if all columns of R are independent, therefore R has full rank, clearly this subspace collapses to a single point(unique solution).

Now all this is good, but why do this ???

The eigen vectors of R define the principal axes of E(W) when loss surface is quadratic in weights (proof of this can be studied from “Active Control of Noise and Vibration (Colin Hansen, Scott Snyder)” section-6.5.2)
and as we can compute second derivative of a surface in the direction of any unit vector u by

H` = u.T*H*u,

choosing u to be eigenvector of H, we can get second derivative in the direction of principal axes of E(W), which, as known from diagonalization of a matrix, will actually be the eigenvalues of H(=R).

We already have our gradient expressed in terms of R and our Hessian = R, and as we can define principal axes of E(W) in terms of R, we will do 2 transformations on the space containing our error surface:

  • Centring to the solution point, by basic translation V` = W-W*
  • Rotation, we have to rotate this V` onto the principal axes of E(W), so that we can analyse the hessian in the directions of principal axes.

Now, if we multiply a matrix to a vector, it results in a spatial transformation(ranging from rotation,translation to squishing the space into a single point). To achieve rotation, we will multiply V` with eigenvector matrix of R, containing mutually orthogonal normalised eigenvectors of R, naming this matrix U(general convention),clearly this matrix will be orthonormal and columns will be of same dimension as W, so, V = UV’

Complete transformation will be

V = U(W-W*)

Based on this our equation of E(W) gets reparameterised as :

E(V) = (1/2)V.T*D*V + E_o

where D is the diagonal matrix containing eigenvalues of R on the diagonal, computed as D = U.T*R*U (also called diagonalisation of a matrix)
and E_o is just E(W*) )

Remember V is still a vector, but each component of V represents the corresponding principal axis of the error surface.

Then,

PartialDeriv(E,vj) = λ_j*vj

λ_j is jth element on the diagonal of D, basically eigenvalue corresponding to the eigenvector stored in jth column of U

SecondPartialDeriv(E,vjvk) = λ_j*dirac_delta(j,k)

Clearly the eigenvalues of input covariance matrix give the second derivative of error surface w.r.t. its principal axes .

Putting these two results in a matrix gives the Hessian as the diagonal eigen matrix D, (H = D), and J(E(V)) = DV

As the basic weight update rule is:
V(k+1) = V(k) — eta*J(E(V))
V(k+1) = v(k) — eta*D*V(k)

V(k) is value of V at kth timestep

V being Nx1 vector, this results in N decoupled equations (as each component in V is orthonormal), and as V=U(W-W*), for optimal solution, V decays to zero, therefore each component evolves along a principal direction

vj(k) = (1-eta*λ_j)^k * vj(0)

vj(0) is jth component of V at initialization

Now for this vj to converge to zero over multiple timesteps(k>0),

|1-eta*λ_j| < 1
0 < eta < 2/
λ_j

As this is just a formulation of exponential decay, vj decays to zero in the characteristic time

tj = 1/(eta*λ_j)

  • For eta in range (1/λ_j , 2/λ_j)
    1-eta*λ_j is negative , hence it converges with oscillatory behaviour and step size is large
  • For eta in range (0, 1/λ_j),
    step size is small, convergence requires high value of k(timesteps)
  • Well, if eta = 1/λ_j, if such choice is possible, convergence is reached in single iteration.

Now, if all eigenvalues are equal,λ_j = λ, for all 1<=j<=N, as H=D, convergence is reached in single step with

learning rate = 1/λ

but this highly symmetric case rarely occurs.Geometrically, this occurs if cross-section of error surface E(W) are hyper-spheres in N-dim space {W}(hyper-sphere formed by encircling principle components, with these eigenvalues being equal, they get stretched or squished equally, thence a hyper-sphere), but this is very rare. The cross-section of E(W) is elliptical with different eigen values along different principle directions.

Based on above equations, eta must be chosen between (0,1/λ_max) , λ_max being the largest eigen value of input covariance matrix and the slowest time constant is

t_max = 1/(eta*λ_min).

The optimal step size eta=1/λ_max gives,

t_max = (λ_max/λ_min)

to decay along the principle direction of smallest non-zero curvature(bascially eigenvector with smallest non-zero eigenvalue, i.e. λ_min)

This clearly shows, that for error surface *quadratic* in weights, learning dynamics are controlled by eigenvalue *distribution* of Hessian.
Now that we know this, why stop here, lets examine the eigen spectrum here.

As H = R,

Rij = (1/p)SUMMATION(xui*xuj, 1<=u<=p)

It is assumed, that the input components, i.e., {xi} are independent for all 1<=i<=N, and each component is drawn from mean m and variance v(considered all components of an input having numerically same mean and variance for convenience). We have p as the datasize and N as vector dimension, say α = p/N (ratio quantifying the training set), the eigen spectrum is yielded as (the complete derivation of the eigenspectrum can be found in the combined works of I. Kanter, Yann Lecun and A. Solla 1991)

where

λ– = [(1-sqrt(α))²/α]v

λ+ = [(1+sqrt(α))²/α]v

For λ ∈ (λ-,λ+), the spectrum is continuous, then in the limit of p and α tending to infinity, λ- =λ+ = v

  • As should be clear by now, if p<N, the number of independent columns of R will be p. R wont be full rank, so a total of (N-p) eigenvalues will be zero out of N, which gives a delta function whose weight in this case is (1-α) for all the zero eigenvalues(basically spectral density for eigenvalue being zero will be 1-α).
  • A large isolated eigen value of order N, call it λ_N is present in the case when inputs are biased i.e., m≠0. This can be clearly understood, consider
    the structure of R for p –> inf limit,
    1. All off-diagonal elements are equal to (product of means of both)
    2. All diagonal elements are equal to v+m²

So, the eigenvector U_N = (1….1) thus corresponds to the eigenvalue

λ_N = Nm²+v

(R*U_N = (Nm²+v)*U_N)

and all other N-1 eigenvalues are equal to v,(how? satisfying the fact that trace(R) = sum of eigen values = N(m² + v) ), moreover the continuous part of spectrum collapses onto a delta function at λ– = λ+= v as p->inf and only one value(=λ_N) is greater than λ+. The largest part of λ_N is eliminated if m=0, that is, centred inputs with no bias, reducing λ_N significantly, and as λ_N was the largest(λ_max), reducing its value affects t_max significantly based on previous equations, clearly showing that using centred inputs improves learning time.

So its clear that biased inputs produce large eigen values and are responsible for slow convergence, to remove this, either the inputs can be centred,
or, another way to deal with this based on our previous equations, is to use individual learning rates inversely proportional to N(number of inputs to the neuron under observation).

All this proves the following clearly:

  • For the loss surface, gradient depends on i/p covariance matrix and weights, but hessian is constant(note that its constant for a given i/p distribution, in a given direction, and that too because the surface is quadratic in W) in the direction of a given principle axis(here ith column of U), only depends on i/p covariance matrix
  • Biased inputs slow down convergence
  • Large eigenvalues(compared to other eigenvalues, not absolutely large) of input covariance matrix also slow down convergence
  • The eigenvalue spectrum of Hessian of error surface is heavily controlled by variance of input component distribution

In terms of BatchNorm:

  • We know “relatively” large eigen values hurt the convergence(basically outliers), and batch-norm suppresses these outliers (based on plots of eigenspectrum in the paper2), pointing to one of the reasons why BN works

Figure 6 from paper2, y-axis shows λ_max/λ_max where λ_max and λ_min are the largest and smallest eigenvalues of Hessian respectively

Figure 7 from paper2 showing the eigenspectrum of Hessian of loss surface for both batchnorm and non-batchnorm resnet

  • In the first part of this blog we have an analytical reason for why “relatively” large eigen values hurt, paper2 shows this in terms of gradient energy. Through analysis(graphs and Appendix E), they show that without batchnorm, the direction of the computed “stochastic” gradient at timestep t is not aligned with the direction towards the optimum, instead it mostly lies in the subspace of the dominant eigenvectors and is mostly orthogonal to the direction towards the optimum. But after batchnorm the gradient direction is almost
    orthogonal to the dominant eigenvector and is better aligned towards the direction of optimum point. Because, when gradient is aligned in the direction of dominant eigenvectors, only those components get trained and others(with low eigen values), take time to train(because the gradient is mostly towards the other directions, the dirn. of dominant eigenvectors), and the projection of gradient in the dirn. of those with low eigen values is small, hence slow convergence. Even above here, as we calculated t=(1/eta*λ), for a given eta, low λ takes more time, now this λ is basically second derivative in the dirn. of a given eigenvector, small means flatter region. And, these dirns. with small λ contribute to almost 50% of L1 energy of hessian eigenvalues

L1 Energy = SUMMATION(λ_i*(w-w*)*(w-w*), 1<=i<=N) ,

Basically these small eigen values “matter” a lot to the loss, the path to optimum consists these directions.

Figure 11 from paper2, y-axis is the ratio of norm of projection of loss gradient in the direction of most dominant eigenvalues to the norm of loss gradient, clearly loss gradient is dominantly in the direction of most dominant eigenvalues in non-batchnorm models

  • Previous point was about L1 energy of hessian eigenvalues(λ_i here), now talking about energy of gradients(gradient energy is bascially depicted by eigen spectrum of covariance of gradient vector), as the small eigenvalues consisted roughly 50% of L1 energy of hessian eigenvalues, almost all of the gradient energy is concentrated in the subspace spanned by the outliers. So basically the ones with small eigenvalue compose small part of gradient energy, but as per second point, they’re important,
    so it makes the entire convergence a slow process.

Figure 12 from paper2, y-axis is the inner-product of loss gradient vector with loss vector itself, clearly in non-batchnorm network, ‘stochastic ’loss gradient is orthogonal to the actual loss direction

  • As stated earlier, the convergence depends completely on these eigenvalues, absence of relative outliers allows use of finely-tuned learning rate based on λ_max, hence making the network invariant to initialisation.
  • Batch norm is said to make training of all weights happen at equal rates, it occurs because of those eigenvalues not having any outliers, as mentioned in the above points, same learning rate((=1/λ_max)) for all parameters will give same rate of convergence.
  • Batch norm can also be seen as reparametrizing the error surface to make it smoother in every direction