How to squeeze out more from your data when training an AI model

Source: Deep Learning on Medium

How to squeeze out more from your data when training an AI model

All of us have heard phrases like — “Data is the new oil” and what not. It’s also well known that most of the Deep Learning models are pretty data hungry and getting appropriate (labelled) data is tough and expensive. Given all this, a natural thing to do would be to squeeze out as much as possible from the data you already have, I’ll go over some techniques which help us do that

Following is a list of items that I’ll cover in this blog.

  1. Augmentation
  2. Transfer Learning
  3. Semi Supervised Learning / Pseudo Labeling
  4. Simulation

Please note that NOT all these techniques are generic and it depends on the problem area whether they would be applicable or not. I’ll specify some use cases where these are applicable when I deep dive into them

Augmentation

This technique is very heavily used in Computer Vision. Let’s take the example of image classification — a picture of a car should be classified as a car even if we do various transformation on the original image for example — flip, rotate, change lighting, convert to grey scale etc. It turns out that feeding all these variants of the input images provide incremental information to the model and improves performance.

Figure 1: Original Image
Figure 2: Augmented Images

The use case of augmentation is not just limited to Computer Vision (where its a very routinely used fundamental concept) but can be done in other areas also. For instance, in NLP some of the augmentation methods that work are:

  • Synonym Replacement — Replace some randomly chosen words from a sentence by their synonyms.
  • Random Word Deletion — Delete some randomly chosen words from a sentence.
  • Intermediate Language Translation — In this method, one translates text in a certain language, say English, to an intermediate language, say French, and then back to source language (English). The rational being the final English sentence would be slightly different than original one. This cool trick was used by some of the top teams in Kaggle Toxic comment classification challenege

The Augmentation is a very powerful technique and requires just creative thinking on the type of augmentation which can work in a problem area and doesn’t have any dependency on any other external data.

Transfer Learning

This concept again is heavily used in Computer Vision, is picking up in NLP due to success of ELMo and BERT and may be applicable in some other areas as well. The core idea is — rather than training a Deep Neural Network model from scratch (with small amount of data that you have), fine-tune an existing model which was trained for a related task (using large amount of data that is publicly available).

For instance — Imagenet is an openly available image dataset which contains roughly 14 M images spanning 20K classes. If the class in which you are trying to classify an image into, doesn’t exist in Imagenet data and you just have a few hundred images — you’ll be better off using a model pre-trained on Imagenet data as the starting point and by just fine tuning the last few layers using your few hundred images you can achieve great performance.

Figure 3: Transfer Learning Process

The reason why Transfer Learning works is — in a Deep Neural Network with dozens of layers — the earlier layers learn more fundamental patterns for example — learning to identify lines, edges etc, and latter layers learn more task specific patterns for example — key characteristics which define a cat in a picture — so just fine-tuning the last few layers on your task specific data does the trick.

Semi Supervised Learning / Pseudo Labeling

This is a pretty widely applicable technique, and works well with all type of data — image, text, tabular etc, the only wrinkle is — the prerequisite for being able to apply this is availability of unlabeled data (which is often a lot easier and cheaper to get) in addition to some labelled data. This technique is very often used by top teams in Kaggle to squeeze out additional performance improvement.

The way it works is — you first train a (relatively smaller) model with all of your labeled data, and generate what is called the “Teacher” model. The “Teacher” model is used to make predictions on the large amount of unlabeled data that you also have. You use the high confidence predictions, sometimes by adding some noise along with your original labeled data to train a bigger model called “Student” model. The resultant model has better performance than the “Teacher” model which was just trained on the original labeled data. Here is a paper with more details.

Figure 4: Semi Supervised Learning

Simulation

For certain problems, typically in Reinforcement Learning, but sometimes in other ares too, one can generate data in a simulated environment and use that for training or testing. For example — DeepMind’s AlphaStar achieved Grandmaster status in StarCraft II by training on tens of thousand of years of virtual (simulated) gaming experience.

Another example where simulated environments are used for training or validation is — models for self driving cars.

There are also many frameworks for example — Habitat by Facebook, Behavior Suite for Reinforcement Learning by Google and Gym from OpenAI for helping run simulated environments.

Summary

While more data is always better from the perspective of model training, it may come at a big time and monetary cost. The least one could do is to utilize their existing data most efficiently. The techniques described above are some of the ways in which you could do that.