Source: Deep Learning on Medium

# Formal Introduction to Generalization

Let’s break down the many temporal stages of the algorithm:

- Training
- Testing
- Usefulness

In the training stage, we continuously output **training errors**. We use these to improve the **parameters** or **weights** in the **Optimizer. **The **Loss Function** is what measures the **Training Error**, in fact it measure all the other errors as well.

In short, the **Loss Function** gives the distance between the actual output **y*** and the algorithm generated **y**. So you can imagine during the training stage, this error gets lower and lower, like this:

Then the next stage is the **test stage**, where we just wanna make sure that our algorithm really worked. Given a giant data set of thousand of training examples, you can break it down to 80% training and 20% testing, or even 50% training and 50% testing, it’s all contextual. It just matters that we always test. Why? In order to get a **generalization error** or **test error**.

This **generalization error** is an indicator of how this trained algorithm is likely to behave out in the wild.

Let’s develop some intuition about these different types of errors. We know that during training, we get our training error to an acceptably low state. Then we do a test with the separated data to get the generalization error. Often, almost always the test error is bigger than the training error.

Although we use the training error to optimize the weight/parameters, the value that we really care about is the **Generalization Error**. How come we attempt to affect a value by indirectly changing stuff up in another stage?

You might have intuited already that because we split our original training data into 2 parts, we can assume improvements made on one side will suddenly appear in the other side.

The field of Statistical Learning Theory provides some answers and helps us formalize this understanding.

We basically assume that the data generating process, or what we’re calling the real **data generating function, **produces some sort of distribution, like this: