Source: Deep Learning on Medium
Regularization Techniques — a list down
Here I will theoretically discuss various regularization techniques used in the learning process.
Regularization techniques are used to reduce the overfitting in the learning process. Overfitting occurs when our model learns the input so much better that it could only further resolve the scenarios that seem similar to the input. It behaves very badly if something gets out of the scope of the input. Overfitting causes a lot of problems if we have our overfitted model in the production. Let’s learn the techniques.
One way to reduce overfitting is to reduce the number of parameters — that’s a conventional approach. Now, instead of reducing the number of parameters, we use the all the parameters, but we penalize the non-required, unimportant parameters which do not affect the output. We penalize those parameters for being non-zero, for their existence — that’s an unconventional approach.
L1 and L2 regularization are two different ways to represent the same thing.
As per the L2 regularization, the thing that matters is a loss, and if the loss in output will be more, we will backpropagate more to decrease the parameter, aka parameters to decrease the loss. In this procedure, weights which would non-zero or slightly near to zero would tend to become zero, thus, ineffective. This is how L2 regularization occurs.
Thus, we add the sum of the square of end weights multiplied with the number
a to the the loss. This number is also known as
L1 regularization is also known as weight decay. As per this regularization technique, instead of adding something to weights, we subtract directly from the gradients. Now, what is subtracted from the gradients is the loss derivative, as defined in the above image. So, we are doing the same thing, if the loss would be higher; still, backpropagation would occur, and our weights would be improved. And if we directly subtract from the gradients something, that would also enhance the weights.
The third type of regularization technique is Dropout. This is an exciting type of regularization technique. As per this technique, we remove a random number of activations. Now, activations are the output obtained by multiplying the weights and inputs. Activations tell about the input since they were derived from the inputs only. So, if we remove a particular part of activations at each layer, therefore no specific activation would learn the input model. Therefore, there won’t be any overfitting regarding the input model. Now, the probability of removing any activation is decided by the machine learning practitioner.
Now, sometimes rather than only removing the activations, we sometimes also remove the inputs. This is certainly not a regular behavior to remove the input, but depending upon your model, eliminating the inputs also helps.
- The first layer in the image represents the inputs. Afterward, layers represent the activations to that layer.
- You may see that we have removed some inputs in the first layer, as shown in the second diagram. We have also removed some of the activations in the subsequent layers. That’s how we define dropout.
Now, we do not define dropouts for the continuous variables because defining dropouts for the continuous variables lead to vanishing their existence literally. But, we define dropout for embeddings. This is because embeddings are just lookups obtained by multiplying a one-hot encoded matrix with the input weights. Therefore, we may remove them.
In pretty much every fast.ai learner, there’s a parameter called
ps , which will be the p-value for the dropout for each layer. An embedding dropout is just a dropout. We define it by
There is an exciting feature of dropout. We talk about training time and test time (we also call inference time). Training time is when we’re doing that those weight updates — the backpropagation. The training time, dropout works the way we just saw. At test time, we turn off dropout. We’re not going to do dropout anymore because we wanted to be as accurate as possible. We’re not training, so we can’t cause it to overfit when we’re doing inference. So we remove dropout. But what that means is if previously
p was 0.5, then half the activations were being removed. This means when they’re all there, and now our overall activation level is twice what it used to be. Therefore, in the research paper on dropout, they suggest multiplying all of your weights at test time by
p. This behavior is handled internally in almost all the libraries.
Batch Normalization is another important regularization technique that helps a lot, preventing the overfitting and training in the right order.
Before understanding how to do batch normalization, let us understand how it helps with the loss. Without batch normalization, the loss has a lot of bumpy surfaces (in blue) because activations have shifted a bit from the required result, and therefore, we need to train harder to obtain the desired results. But, when we apply the batch normalization, the loss surface becomes smooth since it adjusts the activations towards the desired output. Batch normalization is clearly applied to the mini-batch of activations. Therefore, we may apply a higher learning rate and also give momentum to our training. Consequently, it was fantastic research.
Now, let us see how to obtain batch normalization?
This is the algorithm, and it’s straightforward.
- The first thing is to find the mean of those activations — sum divided by the count that is just the mean.
- The second thing we do is we find the variance of those activations — a difference squared divided by the mean is the variance.
- Then we normalize — the values minus the mean divided by the standard deviation is the normalized version.
- Finally, we multiply the output activations with the gamma defined for that layer and adding the biases defined for that layer.
Let us understand it more simply. Let us consider a situation where we want to find the rating for the movie between (0,5). For a particular minibatch of activations, output lies between (-1, 1), which is undoubtedly very far from the desired one. Let us define the output as the function of input and weights.
f is our neural net function. Then our loss, let’s say it’s mean squared error, is just our actuals minus our predicted squared.
L is a loss. Now, if our output activations are not desired, the one to solve this is by retraining the whole network, adjusting the weights, and improving the learning rate and whole more stuff, which is complicated and tiresome. So, instead we do like below:
We added two more parameter vectors. Now it’s elementary. To increase the scale, that number
g (gamma)has a direct gradient to increase the scale. To change the mean, that number
b(beta)has a direct gradient to change the mean. There are no interactions or complexities. Thus, in gist we could say like:
- Batch normalization normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation.
- Batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by gamma and beta.
The next kind of regularization technique is data augmentation. So data augmentation is applied to the image dataset where we apply certain types of transforms to every image like zooming the image, cropping the image, increasing the brightness of the image, flipping the images, toggling the images, etc. There are a lot of transformations defined. Please look through them here.