Source: Deep Learning on Medium
Formal Introduction to Generalization
Let’s break down the many temporal stages of the algorithm:
In the training stage, we continuously output training errors. We use these to improve the parameters or weights in the Optimizer. The Loss Function is what measures the Training Error, in fact it measure all the other errors as well.
In short, the Loss Function gives the distance between the actual output y* and the algorithm generated y. So you can imagine during the training stage, this error gets lower and lower, like this:
Then the next stage is the test stage, where we just wanna make sure that our algorithm really worked. Given a giant data set of thousand of training examples, you can break it down to 80% training and 20% testing, or even 50% training and 50% testing, it’s all contextual. It just matters that we always test. Why? In order to get a generalization error or test error.
This generalization error is an indicator of how this trained algorithm is likely to behave out in the wild.
Let’s develop some intuition about these different types of errors. We know that during training, we get our training error to an acceptably low state. Then we do a test with the separated data to get the generalization error. Often, almost always the test error is bigger than the training error.
Although we use the training error to optimize the weight/parameters, the value that we really care about is the Generalization Error. How come we attempt to affect a value by indirectly changing stuff up in another stage?
You might have intuited already that because we split our original training data into 2 parts, we can assume improvements made on one side will suddenly appear in the other side.
The field of Statistical Learning Theory provides some answers and helps us formalize this understanding.
We basically assume that the data generating process, or what we’re calling the real data generating function, produces some sort of distribution, like this:
Instead of splitting the data perfectly like this:
We randomly sample so that we divide up the distribution more like this.
Although we know that the 2 data sets come from the same data source, here we make IID assumption about the 2 data sets: training and test. IID stands for independent and identically distributed. If this this assumption were to ever break, don’t expect to see close generalization and training errors. Sometimes in the midst of all the coding and too many unknowns, you can forget to clear these fundamental assumptions.
With this assumption, you create one data generating process and use it for both.
So how do we finally make sure to not overfit and not under-fit? How do training errors and test errors lead to the formal understanding of Generalization?
- Training Stage gives the Training Error
- Testing Stage gives the Testing Error
- Under-fitting happens when the training error is too big
- The difference between training and testing error is too big:
You can probably see that the you must first make sure that the training error is small. Once that’s achieved, you keep the training error small and fixed, at least for the next stage, of testing.
These error measures are primarily the “performance measure P”.
So far we’ve touched on the experience E, being data points or examples, being in the design matrix and the labels. We touched on the fact that the Optimizer changes a bunch of learned parameters using the loss score from the Loss Function. Thereby touching on the performance measure P. The last point is the task T. Which is really important to know, because you shouldn’t think of the machine learning as a solve anything and everything magic pill. It’s really only good at few categories of tasks.
Before we look at the categories of tasks it’s good at and the amazing applications in the field. Let’s understand how the Optimizer works.