Sketch Image Clustering based on Semantic Similarity

Original article was published on Deep Learning on Medium

Sketch Image Clustering based on Semantic Similarity

I always had a huge interest in Deep Learning and it’s applications to Computer Vision. I was already familiar with building Deep Learning architectures for Computer Vision tasks using the keras library. I had heard about PyTorch and it’s fast and simple to use API but didn’t have any proper resource to learn it. Fortunately, a few weeks back I got to know about Jovian community and that it was hosting this amazing course Deep Learning with PyTorch: Zero to GANs.

I enrolled in this course and found PyTorch to be a really great framework. Also, I got to know about the amazing Fastai library built on top of PyTorch and I explored it and used it to build this Course Project for the Deep Learning with PyTorch: Zero to GANs course.


In this post we’’ll see how we can use the Fastai library to build an image classification model to classify sketch images and then cluster those images based on their semantic similarities.

We’ll divide the whole project into the following phases:

1. Data Preparation

2. Building the Image Classification model

3. Testing on test images

4. Clustering Images based on Semantic Similarity

1. Data Preparation

The dataset that we are using is the Tu-Berlin Sketch Dataset. It consists of 20,000 images of sketches belonging to 250 different categories (80 images per category). The images have are high quality images of size (1111, 1111).

Load the dataset into memory

We divide the images into train, validation and test split of (48, 16, 16) per category making 12000 training images, 4000 validation images and 4000 test images by defining the split_dataframe function using sklearn’s train_test_split:

Create data for the fastai learner with an image size of (256, 256) and batch size of 128 images:

2. Building the Image Classification model

Phase 1:

In this phase, first we define a model using fastai’s cnn_learner with ResNet34 as base architecture:

All the layers of the pre-trained model are first frozen. We find the best learning rate to train the data using learn.lr_find() function:

We then fit the model on the data for 7 cycles using the learning rate suggestions provided by the graph plotted for the lr_find():

We obtain a validation accuracy of 74.85%. We now unfreeze the pre-trained model and train again by using the suggestions from lr_find():

We move to a validation accuracy of 76.88 % this time. This is a good accuracy because the data is pretty difficult for the model to learn as it is difficult for humans too ( human accuracy is 73.10% ) on the dataset.

Phase 2:

The image size of each of the images in the dataset is (1111, 1111). The size that we used for training is (256, 256). If we could obtain a validation accuracy of over 75% using this size. We can make use of higher image size to help the model learn better.

So, we create a new data for the learner using an image size of (512, 512) and batch size of 24.

But, we have already trained the model on the data with size 256 and batch size 128. If we retrain the model from sratch on the new dataset, all the training we did earlier is lost! Hence, we use the weights learned till now to train more on the new dataset.

We freeze the weights first and then follow all the steps used earlier. The progress after training for 5 cycles:

We then unfreeze all layers and retrain the model for few more epochs:

3. Testing on test images:

We now test the model on the test dataset of 4000 images that we defined earlier while splitting the dataset:

Get Test Accuracy:

We test the model on the test dataset and get the accuracy using accuracy function:

Woah!! The accuracy we obtained is 78.90 % which shows that our model outperformed the human accuracy of 73.10 %!

4. Clustering Images based on Semantic Similarities:

Now, we have the predictions on the test dataset saved in a dataframe. We now want to make clusters of semantically similar images.

For this, we convert the predictions of the images into high dimensional vectors using spacy’s word2vec model trained English language corpus and calculate the pairwise cosine similarity of all unique vectors.

We then prepare a cosine distance(cosine distance = 1 – cosine similarity) matrix and feed it to Agglomerative Clustering Algorithm which then forms clusters based on given threshold distance.

We see that for the test dataset we get 85 clusters based on similarity threshold of 0.43.

Example Clusters formed:

Cluster of birds

Some more examples of the clusters formed are:

Cluster of flying objects

The results formed seem to be truly amazing!

Future Work:

I have used the ResNet34 architecture as a base. ResNet50 can be used to obtain better results.

Also, another idea could be to add a Word Embedding layer at the end of the model architecture itself to cluster similar images based on the feature similarities of the CNN model.