This article covers the content discussed in Batch Normalization and Dropout module of the Deep Learning course and all the images are taken from the same module.
There are two terms: one is normalizing the data and the other is standardizing of the data. Normalizing typically means when we do the ‘max-min shifting’, in the real-world dataset, we could have different ranges for different features/attributes, for example in the below case we have the feature “dual sim” taking on a binary value where the feature “Price” is in 5 digits and so on with other features.
If we have different ranges, then the training becomes difficult; so to avoid that we either normalize or standardize the data.
One way of doing that is we compute the minimum value of every feature and then we shift all the other values of that features by using the minimum value and the maximum value. We subtract the minimum value from each of the values and then divide this by the range(maximum-minimum) of the given data points.
The other way do to this is to make the data to have zero mean and unit variance. The terms normalization and standardization are used interchangeably as we are converting data into a standard normal distribution. Here, we compute the mean value of each of the features and the variance value for each of the features, we then normalize the values of the feature as per the below formula:
After applying the changes as per the above formula, all of the values would be close to 0 and would lie in the range of -1 to 1 roughly. This is known as normalizing the data or standardizing the data.
Let’s see why standardizing the data helps us:
Say we have a situation where we have the below 2D data, we have two features ‘x1’ and ‘x2’ and one of the features has a wider spread, it ranges from -200 to 200 whereas the other feature has a smaller spread, it ranges from 0 to 2.5. The spread is actually the variance of the data.
So, the above is the case before data is standardized; if the features are in different ranges then while training the weights would be scaled appropriately.
Let’s see what the data looks like after the standardization:
Now the data lies in the range of -2 to 2 for both the features that we have and it is normally distributed(mean is 0 and variance is 1).
If we don’t standardize the data, then the weights would be large for some of the features as per the data and if weights are large then the update in the weights would be large and we might overshoot the minima(Gradient Descent) and if we overshot the minima, then we need to update the weights in the other direction and again this time the update would be large as the weights are large, we might again overshoot the minima and we keep on oscillating like this.
So, in general, if the weights are large, then the updates are large and there would be a lot of oscillations. If the features are in different ranges, then some weights have to be large, some weights have to be small to account for the range difference and for larger weights we would have this problem of over-shooting the minima and the oscillations.
And the way to avoid that is to normalize the weights so that all the weights are in the same range.
The other reason to normalize the data is that if we don’t normalize the data then there is a chance that the updates would be biased very much towards the larger weight:
Let’s say ‘w2’ is larger as compared to the weight ‘w1’, if we plot out the ‘δw1’ (delta w1) against ‘δw2’, then as ‘w2’ >> ‘w1’ which implies that:
δw2 >> δw1
i.e we have a smaller update in the direction of ‘w1’ and a larger update in the direction of ‘w2’ and the resulting gradient vector is going to look like the below:
We can see that the resulting vector is biased towards ‘w2’, it is equivalent to moving in the direction of ‘w2’ only(if we moving only in the direction of ‘w2’, our vector would be like green vector in the below image) and the pink vector is very close to the green one meaning our updates are more in the direction of ‘w2’ or more biased towards the direction corresponding to larger weights and this larger and smaller weights are there because the original data is in different ranges and if not normalized, then weights scale-up appropriately.
This is why it is not good to have differently scaled weights especially because the model is a linear combination and then off course we are taking a sigmoid on top of that, but in this summation of ‘wᵢxᵢ’, if some of the inputs are large, we need to balance them by having small weights and if some of the inputs are small then we need to have large weights. So, to avoid these problems, we standardize our data.
We have so far discussed how to standardize the input data, the next point in the story is, we have some weights(which would be initialized randomly, to begin with during training time) connecting the inputs to the neurons in the intermediate layer and we compute some values(after applying non-linearities at appropriate points) at each layer as we pass the input through the network
If we look at any particular intermediate layer say ‘h2’(let’s call it as ‘h’ for now). Let’s say we pass the first data instance ‘x1’(which would have 3 values for 3 features that we have in the network) and say we get the output for this data instance as ‘h1’, similarly for the second data instance ‘x2’, we are calling the output as ‘h2’ and so on for the m’th data point. Say we have ‘d’ dimensions for the output of any intermediate layer, then we can represent the ‘h’ matrix like the following:
Now the second intermediate layer(‘h2’, which we are calling as ‘h’ in this case) is the input for the next intermediate layer and now the same argument holds that if we want to learn the weights of the next layer effectively, we must standardize the data at this ‘h2’ layer.
So, we standardize the data per column/feature/dimension wise at this intermediate layer as well as at all of the other intermediate layers and we standardize using the same formula as for the inputs as discussed above in this article. And this is exactly what we do in batch normalization, just as we standardize the inputs, we standardize the activation values at all of the intermediate layers