Why we need Normalization in Deep Learning? From Batch Normalization to Group Normalization.

Recently, Dr. Kaiming He proposed a new normalization method, Group Normalization, which has aroused the widespread discussion in Deep Learning research community and also gives me a chance why we need Normalization in Deep Learning.

This blog will focus on what normalization does in neural network and what are Batch Normalization and Group Normalization.

Why we need Normalization?

Problem 1: Data Distribution

If we have a task which is to predict the users’ probability of loaning account with their income and age.

Figure1: The range of users’ income and age

From Figure 1, we can observe that the range of income is larger than the range of age. If we do not do some preprocessing methods on these data, the effect of income feature may influence more than the one of age.

Problem 2: Internal Covariate Shift

Figure2: Internal Covariate Shift (from deeplearning.ai)

For a cat classifier, we may face a problem with the diverse data distribution of different batches. Even though there is a function which can split the data point, it is hard for the model to distinguish the data correctly in one mini-batch. We call this phenomenon “Internal Covariate Shift”.

Problem 3: Vanishing/Exploding Gradients

One of the problem in Deep Learning is how to stable gradients value.

For a very deep neural network, we will multiply several gradients. If any of them is too small or too large, the neural network will face a vanishing/exploding gradients problem. Normalization can stable each value to reduce the vanishing/exploding gradient problem.

Normalization and Standardization

To solve above problems, we usually use normalization and standardization techniques during data processing. Today topics, Batch Normalization and Group Normalization, are the same ideas used in neural networks.

Batch Normalization

Batch Normalization is an algorithm which applies standardization on each mini-batch. We calculate the mean and the variance among each mini-batch and standardize our data. Then, we will learn two parameters (gamma and beta) to scale and shift our data point.

Figure3: Algorithm for Batch Normalization

Batch Normalization provides a really strong way to reduce Internal Covariant Shift problem and can also stable the value in neural networks to speed up our training process.

Batch Normalization in Test Time

In test time, we do not have any mini-batch, so we can not get the mean and variance. The alternative way is that we can store weighted mean and variance during training time and use it in testing time.

Pro and Con

The good thing from Batch Normalization is that it really solve the above problems and speed up the training time dramatically. However, the performance of Batch Normalization will be influenced by the batch size developers used. It is a serious problem for the task which can not use too many batch size such like, Object Detection and Video Classification.

Group Normalization

The idea of Group Normalization is processing feature by group-wise normalization over channels. Classical features of SIFT, HOG and GIST are group-wise representations by design, so it is not necessary to think of deep neural network features as unstructured vectors.

Let’s us review how we compute the mean and standard deviation of data in a neural network.

Si is the set of pixels in which the mean and std are computed, and m is the size of this set.

In Batch Norm, the set Si is defined as:

where ic and kc denotes the sub-index of i along channel axis.

In Group Norm, the set Si is defined as:

Here G is the number of groups, which is a pre-defined hyper-parameter. C/G is the number of channels per group. iN denotes the sub-index of i along batch axis.

Here is visualization of these methods.

Group Normalization is independent computations along the batch axis and is applicable for sequential or generative models. It helps deep learning model work better at small mini-batch size.

Source: Deep Learning on Medium