Source: Deep Learning on Medium
Fitting and Generalization
Connecting the story behind each stone (the understanding you gained from them) to generate a grand story or theory is called fitting a model in technical terms.
So, for instance, you coming up with a theory or a reason behind why your air conditioner is not cooling the room as usual, is actually you fitting a model behind the under-performance of your a/c. And this fitting happens as you try to fit all those evidences you got from inspecting the a/c, into a convincing theory, just like a jig-saw puzzle. In layman’s terms, you learned the reason behind the a/c breakdown and the interesting part is, there can be more than one explanation!
Mathematically, learning is fitting a line on your datapoints or evidences. And there can be more than one line possible!
The fitted line or model need not touch all the datapoints perfectly, because that would be overfitting. This means, there will be always some information left out from each datapoint and only such a model will have better generalization performance. So what is generalization performance?
Let’s play Sherlock again. Imagine you cracked a murder case and the culprit was the one who smoked an expensive limited edition cigarette. If you are assigned a second murder case and there are two suspects who smokes cigarettes — one smokes a regular and the other an expensive limited edition, you cannot conclude that the actual criminal is the latter even though all the other circumstances of the crime are similar to the first one you cracked. This is the moment where a data scientist would say your model or theory has poor generalization performance. That is, the model you fitted or learned from the first case cannot be applied to the second case. Firstly, the model for the first case is very much adapted to the first case (overfitted) i.e you have gone deeper into the details of each evidence so much so that the probability of another case having identical set of details is so small or is negligible. If you were to make a general enough model in the first case, you can also apply it to the second case and thus to great extend gives you convincing predictions about who should be the prime suspects.
What you need is a crime pattern theory that successfully explains the first crime and can also explain any future crimes to great extent. Such a theory or model is very general in nature and relatively has less overfitting than an overfitted model.
This philosophy works great if you apply this for your own learning. What you learn is that line.
The line is the physical representation of your knowledge.
This line is telling us that in order to get better understanding than bookworms or overfitters, one must only get the core ideas from each datapoint. You fit a line for physics when you learn physics chapters, so does for chemistry and biology. Thus, we can call them physics line, chemistry line and biology line.
Remember, each datapoint is a stepping stone for your line and every stepping stone is just part of the story and NOT the whole story.
If you focus on just few datapoints and fit a line, chances are, you will be highly biased as it would only contain part of the reality or knowledge that you are seeking. This is called underfitting.
In the murder investigation scene underfitting would mean, you collecting just one or two evidences and reporting the murder to be part of mob-lynching which was actually a one-man crime.
Note: The process of making a line less overfitting than before is known as regularization, in technical terms — in layman’s terms it is a process of making the line or model more general and hence can be called generalization.