Understanding the Math behind Batch-Normalization algorithm, part-1

Original article can be found here (source): Deep Learning on Medium

Understanding the Math behind Batch-Normalization algorithm, part-1

Notice the equation for mean activation in the paper, we will come to this ( or derive ) exact formula in this post.

Before I tell you about the Batch-Norm Math, I’d like to derive a formula for data standardization(or normalization) which we typically do as a pre-processing step in most of the Deep-Learning tasks and try to CONNECT THIS STANDARDIZATION FORMULA WITH THE FORMULA OF BATCH-NORMALIZATION.

Consider we have an image of size (400,600,3) corresponding to its height, width and channel. The parameter ‘c’ corresponds to Red, Green and Blue channel and say we want to normalize this image.
Now, whenever we want to normalize an RGB image, we do it channel wise, right ? And in-order to normalize an RGB image, we need to have a Mean and Standard-Deviation to each of the channels, in this case, for Red, Green and Blue channel.

If we want to calculate the mean of Red channel, we simply sum over all the pixels in the R channel and divide by its number of pixels. And, if we want to find its Std-Deviation, we just take the average squared difference of the whole channel and its Mean, followed by a square-root. In Math notation, we can write in the following way.

Formulae for calculating the Mean and Variance of the Red channel of an image.

In the above image, we have a formula for calculating the Mean and Variance of the Red channel. However, we need the Standard-Deviation, and not the Variance, so just take the square-root of the Variance, we will end-up with the Standard-Deviation for the Red channel. Note that I have shown you the formula for Red channel only. No worries, we can apply the same formula for the finding Means and Std-Devs for Blue and Green channels as well.

Once we have the Means and Std-devs for each channel, we simply take the image and subtract each of its channel by its corresponding Mean and divide it by its corresponding Std-dev, finally we will end up with a Normalized image.

Essentially, I want to state that, whenever we want to normalize Image data, we always do it channel-wise. For example, I pass this image into a VGG-16 network and extract the feature-map af size (50,50,128)->(h,w,c), and ( for some purpose ) I want to normalize this feature map, I would apply the same formula.
Needless to say, I will take the Means and Std-Deviations for each of the 128 channels, and subtract each channel by its cooresponding Mean and divide by its Std-deviation, resulting in a normalized feature-map.

The above formula of normalization is applied for just one image. Typically, we always have a big set of RGB images( say N such training images) and we want to normalize all of them.
One way to do this is to normalize each RGB image by its corresponding Mean and Std-dev. But that would not make sense.
So, typically, what we do is calcuate the Means and Std-dev of each images and take the average of them (and call these Global Means and Std-devs). In this case, our global Means and Std-devs have the information of all the images. Next step is to just normalize each of our image with this single set of Means and Std-devs.
We can easily wrap this step into one nice formula (2.A and 2.B)which is shown below.

Plase note the equation 2.A and 2.B in this image and equation 1.A and 1.B from previous image. We will club both of these equations into one giant formula.

Lets place equation 1.A inside 2.A.

Here, R stands for the Red channel. And this is the Global Mean for the Red channel. Now, I hope you know how to calculate the Global Mean for Blue and Green channel.
And this is the Global Standard-Deviation for the Red channel. You can use the same formual for calculate for Blue and Green channel.

Now comes the beautiful part, I will refactor the formula for channel-wise Mean shown above ( the same kind of refactorization can be done to channelwise Variance as well, its damn easy.!) and will jump to Batch-normalization.

The last equation looks exactly like the equation for a mean activation for any channel `c’

Notice that the equation which is shown in the end of the picture is exactly similar to the one mentioned in the research paper.
Just to clarify that the two equations are similar, lets summarize what topic I started with and how I ended up here.
1. We wanted to normalize a single RGB image (num_channels = 3), and, in-order to do that, we needed to find the channel-wise Mean and Std-Deviations, and we came up with a formula for it.
2. Later, we wanted to normalize not a single image, rather a batch of images ( say N images). So we came up with a method of finding Global channewise Means and Std-devs, just by averaging the individual channels over N batch of images, and found a formula for it as well.
3. Now, we saw that this formula was similar to the one in the Batch-Norm paper.

Before concluding, let me remind you again, we wanted to normalize our data in the first-place, and that’s why we came up with Means and Variances (we also came up with formula for Std-Dev as well, but we won’t be using it, you will get to know in a while).
So, lets normalize our data with our Means and Variances and see how will the formula look like.
I have derived the equation for the normalized image and also written down the formula for Batch-Normalization.

Look.! Both the input Normalization and Batch Normalization formula look very similar.

From the above image we notice that both the equations look similar, except that, there’s a γc, βc, and an Epsilon parameter
If we set these parameters to 1, 0 and 0 respectively, we will arrive at the exact equation which I have derived.

Lets summarize again.
1. We came up with a formula for Input Data-Normalization and, by tweaking some of the parameters in the Batch_Normalization equation, we ended up with similar equations. So what’s the difference and similarity between Input Data-Normalization and Batch-Normalization.?????
There is one similarity and two differences.

Similarity 1. The channel-wise Mean and channel-wise Variance are calculated the exact way as we do in calculating channel-wise Means and Variances for R,G and B channels. However, in BN, often, the channels are huge, the width and height are quite smaller.
Difference 1. We will Normalize the Input data only once. However, in BN, since the mini-batch are sampled randomly, the Means and Std-devs are calculated for every mini-batch.
Difference 2. There are three extra parameters, the Gamma( learnable Std-dev ), the Beta ( the learnable Mean ) and Epsilon (which is typically kept constant(0.00001) because if our Std-dev turns up to be zero, it will be a huge headache).
Difference 3 and the most important difference:- Consider we trained a classification model by Normalizing the Input data and using BatchNormalization layers between some of the Convolution layers.
Now, when it comes to inferencing, we need to normalize our inputs since we had done this step in our training step. So we will use the Global Means and Std-devs obtained during training phase.
Alright, what about the Means and Variances for our Batch-Norm layer ?
We did not have a global Mean and Variance for our Batch-Norm layer, infact they always changed for every mini-batch, so which mini-batch’s Mean and Std-dev to use ????
This leads to one more concept called the Moving- Average and the Moving- Variance. We will use these Moving-Average and Variance for our Batch-Norm.
To simply put, we will take the cumulative of Average and Variance for one whole epoch and divide by number of mini-batches.
There is another way called the exponentially weighted-averagin method.

This concepts will be explained in the next part of the blog.!

CONCLUSION:- Batch-Normalization is just like our Input Data Normalization at its core. It is just the small nitty-gritty details which makes it completely a whole new concept.

Resource I have used https://arxiv.org/abs/1806.02375