Original article was published on Deep Learning on Medium
Overfitting and Diversity of Inputs
Learn instead of regurgitating by generating diversified training data.
A common issue AI researchers face as machine learning methods grow more and more complex — for example, with the rise of huge neural networks with millions of parameters — is overfitting. Overfitting is when the model is capable of holding so much information that in its desperate run towards a lower error, it simply memorizes all the answers to specific combinations of inputs and retrieves them when the input matches that specific combination.
Overfitting means that the model cannot accommodate any variations on a problem, only the exact problem itself. More often than not, what humans believe to be ‘learning’ is disguised overfitting. From memorizing vocabulary to regurgitating facts, our learning is not learning so much as it is storing a combination of inputs and the answer in our short-term memory. Because it is simply stored in your memory as another entry in the dictionary and not as learning, it is forgotten shortly after.
To solve the problem of overfitting, researchers add more data. Eventually, the data reaches such a large size that the model cannot simply store them all in its memory, and hence has a high error when it attempts to use the record-and-regurgitate method, forcing the algorithm to learn the relationships and generalize. In the context of AI, millions, if not billions or even trillions in certain contexts, of training examples are collected. In cases where data collection is difficult, data scientists will generate new data by adding variations to the training data in ways reasonable and faithful to the context.
In cases of human learning, millions of data points are hard to come by. Fortunately, human minds are also more complex, and we have the advantage of the concept. However, more data is necessary. The twenty-word test the teacher prepares Tim for yields exactly twenty questions and answers. In this context, it would be ridiculous to add new words in the name of ‘adding more data’ because they will not be tested. Hence, Tim should resort to data generation, or input diversification — creating additional questions from the ones currently available.
“abundant — plentiful and rich in quantity”, Tim reads from the study guide his teacher gives him. From this, and in combination with other word definitions, Tim creates additional questions:
- “There is an __________ of apples on the farm. We will have lots of cider.”
- “A synonym for plentiful is…?”
- “Unscramble udaantbn.”
By creating additional training data, Tim is able to gain an additional understanding of the word. The first generated question gives Tim experience in using different forms of the word in a sentence such that it makes sense. The second tests his knowledge of the definition. The last tests his understanding of how to spell the world and recognize it, among the other 19 words on the test, given an unordered list of its letters.
By using generated training data that has been diversified, Tim is not only able to learn faster and have more fun doing it than his friends (who have been staring at the study guide or using simple flashcards), he has developed strong retention and understanding of the word.