Self-Supervised Learning and the Quest for Reducing Labeled Data in Deep Learning

Source: Deep Learning on Medium

I could bother you with some more examples, but I guess these 2 speak to the point I want to make.

Current deep learning is predicated on large-scale data. These systems work like a charm when their environment and constraints are met. However, they also fail catastrophically in some weird situations.

Let’s return to classification on ImageNet for a bit. To contextualize, the database has an estimated human error rate of 5.1%. On the other hand, the current state-of-the-art deep learning top-5 accuracy is around 1.8%. Thus, one could perfectly argue that deep learning is already better than humans on this task. But is it?

If that is the case, how can we explain such things?

Credits: Attacking Machine Learning with Adversarial Examples

These examples, that became very popular on the internet, are called adversarial examples. We can think of it as an optimization task designed to fool a machine learning model. The idea is simple:

How can we change an image previously classified as a “panda” so that the classifier thinks it is a “gibbon”?

We can simply think of it as input examples carefully designed to fool an ML model into making a classification mistake.

Credits: One Pixel Attack for Fooling Deep Neural Networks

As we can see, the optimization is so effective that we can’t perceive (with naked eyes) the difference between the real (left) and the adversarial (right) images. Indeed, the noise, responsible for the misclassification, is not any type of known signal. Instead, it is carefully designed to explore the hidden biases in these models. Moreover, recent studies have shown that in some situations we only need to change 1 single pixel to completely fool the best deep-learning classifiers.

At this point, we can see that the problems are starting to stack on top of each other. Not only do we need a lot of examples to learn a new task, but we also need to make sure that our models learn the right representations.

Source: Fooling Image Recognition with Adversarial Examples

When we see deep learning systems fail like that, an interesting discussion arrives. Obviously, we humans do not get easily fooled by examples like these. But why is that?

One can argue that when we need to grasp a new task, we don’t actually learn it from scratch. Instead, we use a lot of prior knowledge that we have acquired throughout our lives and experiences.

We understand about gravity and its implications. We know that if we let a cannonball and a bird feather fall from the same starting point, the cannonball will reach the ground first because of the different effect of the air resistance in both objects. We know that objects are not supposed to float in the air. We understand common sense knowledge about how the world works. You know that if your father has a child, he or she will be your sibling. We know that if we read in a paper that someone was born in the 1900s he/she is probably no longer alive because we know (by observing the world) that people don’t often live more than 120 years.

We understand causality between events and etc. And most curious, we actually learn many of these high-level concepts very early in life. Indeed, we learn concepts like gravity and inertial with only 6 to 7 months. At this age, interaction with the world is almost none!

Early Conceptual Acquisition in Infants [from Emmanuel Dupoux]. Yann LeCun slides

In this sense, it would not be “fair” to compare the performance of algorithms with humans — some might say.

In one of his talks on self-supervised learning, Yann LeCun argues that there are at least 3 ways to get knowledge.

  • Through observation
  • From supervision (mostly from parents and teachers)
  • From reinforcement feedback
Different sources of knowledge humans acquire through life. Learning by observation/interaction, supervision, and feedback.

However, if we consider a human infant as an example, interaction at that age is almost none. Nevertheless, infants manage to build an intuitive model of the physics of the world. Thus, high-level knowledge like gravity could only be learned by pure observation — At least, I haven’t seen any parents teaching physics to a 6-month baby.

Only later in life, when we master language and start going to school, supervision and interaction (with feedbacks) become more present. But more importantly, when we reach these stages of life, we already have developed a robust model world. And this might be one of the main reasons why humans are so much more data-efficient than current machines.

As LeCun puts it, reinforcement learning is like the cherry in a cake. Supervised learning is the icing and self-supervised learning is the cake!

Source: Yann LeCun

Self-Supervised Learning

In self-supervised learning, the system learns to predict part of its input from other parts of it input — LeCun

Self-supervised learning derives from unsupervised learning. It’s concerned with learning semantically meaningful features from unlabeled data. Here, we are mostly concerned with self-supervision in the context of Computer Vision.

The general strategy is to transform an unsupervised problem into a supervised task by devising a pretext task. Usually, a pretext task has a general goal. The idea is to make the network capture visual features from images or videos.

Pretext tasks and common supervised problems share some similarities.

We know that supervised training requires labels. These, in turn, are usually collected with the effort of human annotators. However, there are many scenarios in which labels are either very expensive or impossible to get. Moreover, we also know that deep learning models are data-hungry by nature. As a direct result, large-scaled labeled datasets have become one of the main walls for further advancements.

Well, self-supervised learning also requires labels for the training of pretext tasks. However, there is a key difference here. The labels (or pseudo-labels) used to learn pretext tasks have a different characteristic.

In fact, for self-supervised training, the pseudo-labels are solely derived from the data attributes alone.

In other words, there is no need for human annotation. Indeed, the main difference between self and supervised learning lies in the source of the labels.

  • If the labels come from human-annotators (like most datasets) it is a supervised task.
  • If the labels are derived from the data, in which case we can automatically generate them, we are talking about self-supervised learning.

Recent studies have proposed many pretext tasks. Some of the most common ones include:

  • Rotation
  • Jigsaw puzzle
  • Image Colorization
  • Image inpainting
  • Image/Video Generation using GANs
Credits: Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

During self-supervised training, we challenge the network to learn the pretext task. Again, the pseudo-labels are automatically generated from the data itself and used as the training targets. Once training is over, we often use the learned visual features as transfer knowledge to a second problem — the downstream task.

In general, the downstream task can be any supervised problem. The idea is to use the self-supervised features to improve the performance of downstream tasks. Usually, downstream tasks have limited data and overfitting is a big concern. Here, we can see the similarity to common transfer learning using pre-trained ConvNets on large labeled databases like the ImageNet. But with one key advantage.

With self-supervised training, we can pre-train models on incredibly large databases without worrying about human-labels.

In addition, there is a stubble difference between pretext and usual classification tasks. In pure classification, the network learns representations with the goal of separating the classes in the feature space. In self-supervised learning, pretext tasks usually challenge the network to learn more general concepts.

Take the image colorization pretext task as an example. In order to excel in it, the network has to learn general-purpose features that explain many characteristics of the objects in the dataset. These include the objects’ shape, their general texture, worry about light, shadows, occlusions, etc.

In short, by solving the pretext task, the network will learn semantically meaningful features that can be easily transferred to learn new problems. In other words, the goal is to learn useful representations from unlabeled data before going supervised.


Self-supervised learning allows us to learn good representations without using large annotated databases. Instead, we can use unlabeled data (which is abundant) and optimize pre-defined pretext tasks. We can then use these features to learn new tasks in which data is scarce.

Thanks for reading.