Gradient Centralization

Original article was published on Deep Learning on Medium

How to achieve more efficient training and better regularization in your DNN model by adding just one line of code?

Recently published paper proposes a simple answer: Gradient Centralization, a new optimization technique which can be easily integrated in the gradient-based optimization algorithm you already use. All you need to do is to centralize your gradient vectors so that they have zero mean!

Today I will explore the theoretical details of the proposed method. The practical approach will be discussed in my next article.

The math behind the one line of code

Notation

Let’s introduce the notation:

Weights
Loss function and gradients

The Gradient Centralization operation

The GC operation is defined as follows: from every column of the gradient matrix we subtract the mean value of the column. For example if the gradient matrix looks like this:

after the transformation it will look like this:

since the respective column mean values are 4, 3, 5, 0.

In other words we transform each gradient of loss function w.r.t. to the weight vector so that its mean is equal to zero. We can denote it using a fancy notation:

Vector formulation of the GC operator

If you want this to look even more advanced, you can introduce a cool P operator and use the matrix formulation:

Matrix formulation of the GC operator

By means of direct matrix multiplication you can easily check that the above equation holds up:

so indeed from each column we subtract the mean value of this column.

Geometrical interpretation

Now, the goal of this notation is not to make simple things more obscure, but to expose an elegant geometrical interpretation of the whole procedure. If you know a little bit about algebra, you perhaps can see already that the original gradients are being projected on a certain hyperplane in the weight space by means of using a projection operator P.

Projection operators

To make sure that we are on the same page, let’s review some basic facts about projection operators looking at a simple example — let’s consider the following matrix P and what happens when it’s applied to any vector v in a 3-dimensional space:

Example of a projection operator in a 3-dimensional space

As you can see, v looses its z component — in other words it is being projected on the XY plane:

Projection visualization

What happens if you apply the projection operator again? Nothing, the vector cannot get more projected to the XY plane. This is one of the defining properties of a projection operator:

If the projection is orthogonal another condition must be satisfied:

If a vector e is perpendicular to the the projection plane, it is also perpendicular to a projection of any vector v to that plane:

e.g. for our previous example with projection to XY plane, we can see that any vertical vector e satisfies the above equation:

GC as a projection

Now, if you check the above-mentioned conditions (1) and (2) for the previously defined GC operator P, you will conclude that it is an orthogonal projection.

What is more, you can check that:

which means that e is a normal vector of the hyperplane on which gradients are being projected.

If we take any old weights represented by a vector w in the weight space and correct it by the projected gradient of the loss function (rather than the gradient itself), we will obtain the following new weights vector w’:

This leads to:

Applying transposed normal vector e to both sides of the above equation leads to:

which means that for the weight computed in successive training iterations (0, 1, 2, …, t) the following is true:

The optimization problem can be therefore written as:

which — in the words of the authors of the article — means that

GC can be viewed as a projected gradient descent method with a constrained loss function.

source: https://arxiv.org/pdf/2004.01461.pdf

How this leads to regularization?

This constraint on the weight vectors regularizes the solution space of w leading to better generalization capacities of a trained model.

Apart from regularizing the weight space, the proposed approach regularizes the output feature space. Given an input vector x, the output activation in the step t is computed as:

Let’s consider an input vector x’ which differs from x only by a constant intensity change γ:

In can be shown (for details check the Appendix of the original article) that:

Therefore, by choosing small initial weights we can ensure that output activations are not sensitive to intensity changes of input features.

The article also mentions that the proposed technique accelerates the training process by means of optimization landscape smoothing and gradient explosion suppression, but I will not discuss these here.

Next time I will show how to actually use the proposed technique.