Source: Deep Learning on Medium

**Let’s go through the terms one-by-one and understand why BN matters?**

**1. Normalization**

The numerical features are normalized before being fed into Machine Learning or deep learning models so that the data could become scale-independent.

x_new=(x_old- x_min)/ (x_max-x_min)

In the case of neural networks, generally, they fasten the optimization process and thus speed up the training. For more details on why normalization is required before training neural networks, click here.

**2**. **Internal Covariate shift**

A quick refresher about neural networks- While training neural networks, the data points are feed into the training network and then the error between the expected value and the output value from the network is calculated. It is then optimized using various techniques like SGD, AdaDelta, AdaBoost, Adam, etc.

In the context of neural networks, they are optimized using batches of data being fed into the network. If the training and test data have different sources, if they will have different distributions. These batches of inputs can have a different distribution i.e a slightly varying mean and variance from the sample (i.e. entire dataset considered) mean and variance.

The weights of various layers of the neural networks are updated while training. This means that the activation function changes over the course of training and these weights are then fed into the next layer. Now, when the input distribution of the next batch(or set of inputs) is different, the neural network is forced to change and adapt according to the changing input batches. So, training neural networks becomes much more complex when the values of parameters change continuously with new input batches. This changing parameter value slows down the training process by requiring lower learning rates and more careful optimization. We define *Internal Covariate Shift* as the change in the distribution of network activations due to the change in network parameters during training. Refer to the original paper for more details.

Batch Normalization is the solution to the internal covariate shift and makes the optimization care less about the initialization.

**3. Addressing the Internal Covariate Shift**

It is already known that the loss function would converge faster if it is whitened – i.e. linearly transformed to have zero means and unit variances, and decorrelated. This whitening process could be done at every step of training or at a fixed interval. If the input data batches have been whitened(either at every training step or at some intervals), it would be a step towards addressing the Internal covariate shift. But here is what the problem would occur.

*Excerpts from the original **paper**:*

We could consider whitening activations at every training step or at some interval, either by modifying the network directly or by changing the parameters of the optimization algorithm to depend on the network activation values (Wiesler et al., 2014; Raiko et al., 2012; Povey et al., 2014; Desjardins & Kavukcuoglu). However, if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step. For example, consider a layer with the input u that adds the learned bias b, and normalizes the result by subtracting the mean of the activation computed over the training data: xb = x − E[x] where x = u + b, X = {x1…N } is the set of values of x over the training set and E[x]=(1/N)*sum(x) for entire training data X. If a gradient descent step ignores the dependence of E[x] on b, then it will update b ← b + ∆b, where ∆b ∝ −∂ℓ/∂xb. Then u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b]. Thus, the combination of the update to b and subsequent change in normalization led to no change in the output of the layer nor, consequently, the loss. As the training continues, b will grow indefinitely while the loss remains fixed. This problem can get worse if the normalization not only centers but also scales the activations. We have observed this empirically in initial experiments, where the model blows up when the normalization parameters are computed outside the gradient descent step.

To address the problem with this approach, we make sure that for any parameter values, the network produces activations with the desired distribution- as it would allow the gradient of the loss function to account for the normalization. Also, it could be seen that the whitening of each layer would make training even more complex.

So, we make two assumptions to simplify the optimization process. 1. Instead of whitening the input features and output together, we normalize the features separately with mean 0 and variance 0f 1. This normalization is done separately for all the d-dimensions present in the dataset.