Classifying Images without Labels

Original article was published on Deep Learning on Medium

Classifying Images without Labels


Imagine you have a dataset of images but no labels. Can these images be put into clusters and classified? Can conventional clustering algorithms like K-Nearest Neighbors be applied to such images? In this article, I will explain a recently published paper which tackles these problems and produces state of the art results. The paper can be found here.

Current Approaches

Applying clustering algorithms to images is quite difficult. For instance, consider two images of cats, one brown and one white. The pixel values, being so different, cannot be directly used. So we need a feature selector. Hand-engineered features are quite tedious to produce and do not perform that well in practice. Convolution Neural networks have been found to be quite capable as feature selectors for images.

One solution is to train a network on a dataset with labels and fine tune on the unlabeled data. Such networks perform poorly and often lead to unbalanced classification.

Another solution is an end to end approach like DeepCluster where images are fed into a network and the network produces a probability distribution. Such approaches have been found to be quite sensitive to network initialization which indicates they are influenced by low level features like color which we want to avoid as observed in the example above.

Semantic Clustering by Adopting Nearest neighbors (SCAN)

The method described in the paper decouples the feature representation part and the clustering part resulting in state of the art accuracy. Also they have added a special trick called Fine Tuning through self-labeling to further improve results. Lets discuss the steps mentioned in the paper.

Self-Supervised learning of feature representations

In self-supervised learning we utilize the input data to generate corresponding labels. This might seem confusing but is quite intuitive in terms of this problem. The main idea behind this is if we transform the images, whether it be cropping or changing contrast, the high level features should remain the same. So we can transform the images and get a new dataset of transformed images. Now, we use a convolution neural network, Φθ, to produce feature representations of both the images and the transformed images and optimize Φθ such that the distance between the outputs for the image and the transformed image is minimized.

Loss function

Semantic Clustering Loss

Since we have the feature representations from Φθ, can’t we just apply clustering algorithms like K-means? But such an approach produced degenerate results, with images often being clustered into a single class. To avoid this issue, a new convolution network Φη is introduced which takes in the feature representation of an image from Φθ and produced a probability distribution. Φη is optimized by minimizing the following objective :

For an Image I, X is the output of Φθ. The K images which have feature representations from Φθ closes to X are first found. The negative log of the dot product between X and the outputs of Φη using feature representations of the nearest neighbors of image I is minimized. This product is minimized when the the outputs of Φη are one-hot encoded and I and its neighbors are classified into the same cluster.

Entropy is referred to as randomness or uncertainty. Minimizing the negative of entropy avoids classification into a single cluster.

Fine Tuning through Self-Labeling

The writers of the paper observed, depending on the value of K, sometimes images which were semantically different were clustered together. To improve results, images which had probability of belonging to a cluster above some threshold were selected and given labels. Now these labelled images were used with their neighbors to optimize Φη further using cross entropy loss.


  1. Train Φθ using self supervised training
  2. Find each image, find its nearest neighbors in terms of the outputs of Φθ
  3. Train Φη using the outputs of Φθ
  4. Find images with probabilities above a certain threshold
  5. Train Φη using the self-labelled images on cross entropy loss


Predictions by SCAN

The approach described above has produced state of the art results on CiFAR10 and CiFAR100 . But it comes with its own set of disadvantages. There are many hyperparamters to consider, like the number of clusters, the value of K or the transformation for self-supervised training. Also the algorithm performs less than optimally for large class sizes and these can be observed in the case of ImageNet. Still the paper paves the way for better unsupervised image classification models.