Source: Deep Learning on Medium
Deep learning has found great success in a lot of areas ranging from natural language processing, computer vision, speech recognition, to a lot of applications with structured data(eg. ads, web/application search, computer security, logistics such as figuring out where to send drivers to pick up and drop off things and more). Sometimes a researcher with a lot of experience in NLP might try to do something in computer vision or other area or vice-versa. But intuitions from one domain or from one application area often do not transfer to other application areas and the best choices may depend on the amount of data, number of input features, whether training is on GPUs or CPUs.
When training a neural network we have to make a lot of decisions as shown in the picture below.
When we are starting on a new application, it’s almost impossible to correctly guess the right values for all of these, and for other hyperparameters. So in practice applied machine learning is a highly iterative process, in which we often start with an idea, code it up, run an experiment, get back a result that tells us how well this particular network, or this particular configuration works. And based on the outcome, then refine the ideas and change the choices to try to find a better and a better neural network.
Setting up our data sets well in terms of train, development and test setscan make us much more efficient at the iterative process. The goal of the dev set is that we are going to test different algorithms on it and see which algorithm works better. And the goal of the test set is, given our final classifier, to give a pretty confident estimate of how well it’s doing.
Several year ago a split of 70/30% for train/dev set or 60/20/20 for train/dev/test set was considered best practice in machine learning. But in the modern big data era, where, we might have a million examples, then the trend is that dev and test set is much smaller percentage of the total. So a 98% train, 1% dev, 1% test is a pretty common divide.
One other trend we’re seeing in the era of modern deep learning is that more and more people train on mismatched train and test distributions.
Let’s say we’re building an app that lets users upload a lot of pictures and goal is to find pictures of cats. In this case our training set comes from cat pictures downloaded off the Internet, but dev and test sets might comprise cat pictures from users using our app. The rule of thumb to follow in this case is to make sure that the dev and test sets come from the same distribution. Finally, it might be okay to not have a test set.
Bias and Variance
By looking at the algorithm’s error on the training set(which we call bias) and on the dev set(which we call variance) we can try different things to improve the algorithm
High Bias ?
- Bigger network ie more hidden layers or more hidden units.
- Train it longer.
- Try some more advanced optimization algorithms/better NN architecture
High Variance(aka overfitting) ?
- Add more data
- Search for a better NN architecture
In the modern deep learning/big data era, we can keep training a bigger network to reduce the bias without affecting the variance and we can keep adding more data to reduce the variance without affecting the bias. And if we have both, we can drive both bias and variance down. We have to make sure we regularize it appropriately.