Learning Like Babies — Image Classification Using Semi-Supervised Learning

Source: Deep Learning on Medium

Inside AI

Image Classification Using Semi-Supervised Learning

Go to the profile of Manrique

This project was developed by Evgenii Nikitin, @sunor.yavuz, and Manrique

Features of visual hierarchy
Simple CNN architecture for image classification (right)

Convolutional Neural Nets (CNNs), a concept that has achieved the greatest performance for image classification, was inspired by the mammalian visual cortex system. In spite of the drastic progress in automated computer vision systems, most of the success of image classification architecturesFw comes from labeled data. The problem is that most of the real world data is not labeled.

According to Yann LeCun, father of CNNs and professor at NYU, the next ‘big thing’ in artificial intelligence is semi-supervised learning — a type of machine learning task that makes use of unlabeled data for training — typically a small amount of labeled data with a large amount of unlabeled data. That is why recently a large research effort has been focused on unsupervised learning without leveraging a large amount of expensive supervision.

“ The revolution will not be supervised” — Alyosha Efros

Inspired on this concept, our class at NYU hosted a competition to design a semi-supervised algorithm from 512k unlabelled images and 64k labelled images from ImageNet.

Semi-supervised learning is in fact the learning method mostly used by infants. When we are born, we don’t know how the world works: we don’t distinguish gravity, we don’t understand depth, or much less do we recognize human expressions. The human brain absorbs data mostly in an unsupervised or semi-supervised manner.

Babies learn orientation unconsciously from a very early age. This can be considered ‘semi-supervised learning’.

Method: Semi-supervised Representation Learning by Predicting Image Rotations

For the competition, the method employed by our algorithm recognizes the 2d rotation that is applied to the input image [2]. In a way, this assimilates to how babies learn to see the world through experience. For example, through time babies get used to how objects stand and how mountains lie below the sky. As it turns out, even this simple task of image rotation allows the network to learn features that are relevant for supervised tasks.

We used ResNet architecture (more specifically ResNet-50) for our final submission. ResNet models achieve state-of-the-art performance on many classification data sets. Moreover, ResNet residual units are invertible under certain conditions, which might help to preserve information from the early layers of the network obtained as a result of pre-training. We utilised a large set of unlabeled images to pre-train a ResNet-50 model on the rotation pre-task. The goal of the model is to predict one of the four rotation angles (0, 90, 180, 270) for each input image. As a result of training the model on this auxiliary task, we expected it to learn features that would be helpful in the main classification task.

Semi-supervised learning using image rotation (left) and data augmentation (right)

After pre-training, we trained a linear classifier on top of this feature extractor for the main task, and fine-tuned the whole network by gradually unfreezing the layers starting from the top. We also used data augmentation. This strategy led to a significant boost in accuracy (28.8%) as compared to the previous model.


The figure below shows the training and validation loss curves, and accuracy for the final model over the course of training. Red line shows the moment of switching from training only linear classifier to fine-tuning the whole network.

Train loss, validation loss and accuracy for the final submission

The figure below visualizes nearest neighbors for 4 random images in the validation dataset (first column). As it can be seen, semi-supervised rotation model clearly managed to learn meaningful representations of the images.

Nearest neighbors in the validation dataset: Semi-supervised model (left) and Supervised model (right)

Overall, we can conclude that semi-supervised pretraining on the unlabeled data helped to improve accuracy on the main classification task. However, it is not clear how much more beneficial semi-supervised training is as compared to pre-training on existing large labeled datasets such as ImageNet.

For more information, you may refer to our paper.


[1] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision (ICCV), 2015.
[2] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. CoRR, abs/1803.07728, 2018.
[3] Aidan N. Gomez, Mengye Ren, Raquel Urtasun, and Roger B. Grosse. The reversible residual network: Backpropagation without storing activations. CoRR, abs/1707.04585, 2017.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016.
[6] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. CoRR, abs/1901.09005, 2019.
[7] Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016.
[8] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.