Self-training with Noisy Student

Source: Deep Learning on Medium

What is self-training?

Self-training is one of the simplest semi-supervised methods. The main idea is to find a way to augment the labeled dataset with the unlabeled dataset, after all, getting labeled data is very costly. Annotating data is mundane.

Self-training first uses the labeled data to train a good teacher model, then use the teacher model to label the unlabeled data. As we know that all the predictions of the teacher model on the unlabeled data can’t be good, hence in classical self-training, we select a subset of this unlabeled data by filtering out the predictions(aka pseudo labels) using a threshold for the score. This subset is now combined with the original labeled data and a new model, student model, is jointly trained on this combined data. This whole procedure can be repeated for n number of times until the convergence is reached.

Nice approach to make use of the unlabeled data. But the paper emphasizes on something else, a noisy student. What’s the deal with that? Is it different from the classical approach?

Yes, you are correct. The authors found out that for this method to work at scale, the student model should be noised during its training while the teacher model should not be noised during the generation of pseudo-labels. Because noise is an important piece (which we will be looking into detail in a minute) for this whole idea to work, that’s why they called it a noisy student.

The Algorithm

The algorithm is similar to classical self-training with some minor differences.

Noisy Student algorithm

The main difference is the addition of noise to the student using different techniques like dropout, stochastic depth, and augmentation. It should be noted that the teacher is not noised when it generates the pseudo labels.

Are you mad? You are saying that just adding dropout, of course, which no one would have thought of (LOL), produced SOTA on ImageNet. And this paper is from Google Brain team, right?

Noise, when applied to unlabeled data, enforces smoothness in the decision function. Different kind of noise has a different kind of effects. For example, when augmentation is used, the model is forced to learn to categorize a normal and the corresponding augmented image in the same category. Similarly, when dropout is used, a model acts as an ensemble of models. Hence a student with noise is a more powerful model.

Apart from noise, two more things are very important for the noisy student to work well.

  1. Even though the architecture of the teacher and student can be the same, the capacity of the student model should be higher. Why? Because it has to fit a much larger dataset (labeled as well as pseudo labeled). Hence the student model must be bigger than the teacher model.
  2. Balanced data: The authors found out that the student model works well when the number of unlabeled images for each class is the same. I don’t see any specific reason for this except the fact that all classes in ImageNet have a similar number of labeled examples.

This is a high-level overview of what has to be done for the student model during training but there must be much finer details about the experiments carried out, right?