Source: Deep Learning on Medium
Hyperparameter Tuning in Neural Net
Batch Normalization and other methods to systematically tune your hyperparameters
How to tackle hyperparameters ?
If you have got started or have been working with Neural Network for a while, am sure you would have come across situations when you had so many hyperparameters to optimize. It might be tedious or at times confusing how to go about it. In this article we will go over few approaches which could be taken to optimize those hyperparameters.
Before we come with an approach, let’s list few important hyperparameters which we often come across in Deep Learning.
Here in this list, we will typically have β set to 0.9 and β1, β2, ε set to 0.9, 0.999 and 10⁸ respectively.
Now, let’s explore how we can systematically tune the hyperparameters.
- Now the first systematic approach could be select some random values (but within certain meaningful range) for a combination of hyperparameters (two or more) and then train the model to see what’s the accuracy on the training set. Now the question would be what is meant by “meaningful range” and in short it means the “right scale”.
- Say, in doing that you have figured out a range of values where your model is working well. Then zoom in to that range of values (smaller region) and then sample more densely within that space.
Although these approaches are far from specific, it is wise to train your models on different tasks like NLP, Computer Vision or something else with the above framework in mind. Eventually, we get to a point that we know what works well in which scenarios.
However, there is another method which is used widely, which makes the hyperparameter search lot easier, robust and help train very deep networks efficiently.
When training a model, normalizing the input features can speed up learning. We compute the mean and subtract from the training set and compute the variance and then divide the training data with the variances to normalize the input features. Now this works very well for a simple network like a logistic regression where you have just one layer and you are only trying to optimize the weights and biases for that layer.
Now say you have a deeper model and along with the input features you also have activation for all the layers. Now with the same concept as above, if we can normalize each and every layer then the overall training of the model will be much faster.
So the question is for any hidden layer, can we normalize the values of a[l] (current activation layer “l”) so that we can train the next layer weights w[l+1] and biases b[l+1] faster ?
This is what the Batch Normalization tries to achieve. In the below example we will batch normalize Z so that W and b are trained faster.
Here you can see how we normalized the Z, also note that when we normalize “z” then it is converted to 0 mean and variance to be 1. But we don’t want all the hidden units weight to be such. So, we introduced beta and gamma to have a different distribution. Using this technique we can perform batch normalization to any layers in a neural network to make the model train faster.
Now, in practice —
- Batch norms are used in mini-batches and not directly on all the “m” examples at a time.
- At test time you will process one example at a time. So, you will not have a mini-batch and doing a mean and variance on just one instance (observation) doesn’t make sense. So, in test time we should come up with some different estimate of mean and variance. What we do is, we use an exponentially weighted average across the mini-batches in the train set for each layer and apply it to the test set.
- It is recommended to use a neural network framework like Tensorflow, or PyTorch or something similar to implement this Batch Norm as those libraries have these algorithms optimized and also the abstractions they create make them very user friendly.
- Deep Learning Specialization by Andrew Ng & team.