Source: Deep Learning on Medium
Finding visual content with semantic meaning is a very important task. It has applications in the field of e-commerce(using camera to shop for a similar furniture), recommendation systems (like PinSage @ Pinterest or discover weekly @ Spotify). Loc2Vec is one such algorithm which uses geographic location as input and outputs a embedding. In my previous blog about Loc2Vec implementation, I briefly touched upon a wonderful library called Annoy but didn’t elaborate. However, it is such a good tool that it is worth exploring in detail. In this blog, I will use Annoy to build a rudimentary Reverse Image search system using deep learning. In this system, a sample image formulates a image query.
Instead of using map data to illustrate this, we will use cartoons. Teasing meaning from map tiles for similarity is hard. It would not be evident if the model made a mistake in creating the embedding or if it was the similarity query that went wrong. It is far easier to develop something if we can keep the moving parts to a minimum. Also, cartoons are much more fun isn’t it? In this “toy Reverse Image Search system”, we will use pytorch models pre-trained on imagenet without any fine-tuning, to generate the embedding / hashes. We will use annoy to build a nearest neighbor search database for image retrieval. We will also compare how suitable various stock CNN models are for this dataset. My code to implement this available here.
Annoy is a library that enables approximate nearest neighbor search for points in n-dimensional space. It is very similar to libraries like FAISS, NMSLIB and LSH. It has a very elegant API which makes it super easy to use.(https://github.com/spotify/annoy#python-code-example). Erik, the creator to the library, has also blogged in detail about various usages. He has also written in detail about what happens under the hood; You should check it out — it is very well written.
For building project, we will use a cartoon dataset that I downloaded. I am not sharing that as it is not mine to share. However, you should be able to substitute with any set of images and it should work. Manga comics search engine anyone?
About the dataset:
(You can safely skip this section if you are here for just the deeplearning part of it)
Amul girl is the mascot of the Amul butter company and was created more than fifty years ago to sell more dairy products. She is a household name in India and she still continues to be featured in cartoons depicting current events in sports, politics, international events, and even obituaries. She even asked daring questions when citizens failed to ask. It was a source of entertainment for a billion people and is the Indian equivalent of Got Milk? Campaign but far far successful one.
You know an ad-campaign or product is successful when the product name becomes synonymous with the generic one. Few examples that come to mind are google, kleenex, escalator, bubble wrap, photoshop, post-it. But Amul is not synonymous with butter, but with cute; In India, the word cute is synonymous with “Amul baby”. People look at a cute kid and say “Amul baby”.
To me, the more inspiring part is the man behind the operation, Dr. Verghese Kurien — one of my childhood heroes. A metallurgical and nuclear physics graduate, who reluctantly went to dairy engineering to literally create a revolution called the “white revolution”. He is responsible for making India the largest producer of milk and the best part is he doesn’t even drink milk. He is the best kind of engineer there is — a nation builder. His birthday is celebrated as the National Milk day in India.
This cartoon dataset depicts decades of events from India and does not have any tags to index it properly making it a fun dataset to build a search engine around.
If we want to search images at a semantic level, we cannot use classical techniques like histogram, SIFT, HOG, etc to reduce the image to compact dimensions. At least, not trivially. People have gotten it to work for particular domains with a lot of feature engineering effort. However, CNNs are very good at this job and we will use several vanilla CNNs pre-trained on Imagenet data and categories.
We will use CNNs as a black box that takes an image as input and emits a vector of size 1000 (this is the number of categories in Imagenet) as illustrated in the image and function below. Each image is reduced to a vector and stored in Annoy.
At query time, the vector for the new image, or index of the image is provided as input and Annoy provides a list of nearest neighbors.
The two code snippets above form the core the of the image retrieval engine and is agnostic to any deep learning framework. Of course, the whole code is available here too!
Note: Normally, we don’t use the network without fine-tuning. For example, Imagenet does not contain anything about cricket or elephant headed human. Both of them frequently appear in my dataset.
Also, instead of using the last layer, we can use the penultimate layer of the networks. The intuition behind this idea is that the last layer is fine tuned for the categories, but the previous layer are generic. We could also use more data from previous layer to improve the results. Some amount of experimentation is required to arrive at the right balance based on the target dataset. These exercises are beyond the scope of this blog and are left as a exercise to the reader.
I will examine the results in two ways. I will first take a image at a time and examine the results offered by nearest neighbor search for 5 different neural networks. I will briefly discuss how one network performs vs. the other, but you may be able to see other differences that I might have missed. One thing I need to note is that these images are not random. I cherry picked them based on interestingness and to illustrate success cases and failure cases.
Next, we will examine the results Kaggle style — use an ensemble of models and show just the top ranking ones. You will notice that each network operates at a different level of semantics, that often times the ensemble performs worse than individual models. Perhaps pre-training the network would fix this… but the point I am trying to make is that ensembling is not a quick-fix solution for all problems — at least not without a lot of thought and work.
In the Usain bolt example, we can see that all the results are completely good! Not sure if it is because of the color palette of the image queried; It was a anomaly in my dataset!
In the Paul walker example above, we see the evolution of the results from Alexnet to Inception. Alexnet fails completely, but Inception even recognizes the interior of the car and shows it as a similar image! Not Bad!
This is the rocket launch image impressed me; while Alexnet does not give anything meaningful, VGG gives something that matches at the contour level. However, Densenet is able to make the connection between a rocket and mobile cellphone tower. Inception takes it to the next level and comprehends that this is about space/mars & military! As we go from simpler model to complex, we can see the evolution in the semantic association the model makes. Unfortunately, It also makes horrible mistakes as very first result of inception is completely irrelevant. Perhaps fine-tuning will fix this.
We see similar kind of evolution above, but this time I prefer the results of Alexnet and Densenet instead. No single model fits all usages, esp without fine-tuning.
However in the above image, VGG/Resnet results are better and inception and Alexnet return irrelevant results further driving home the point that each network has its own advantages and disadvantages.
Below are results from ensembling the results from all the models and picking the most repetitive items.
Hope you all had fun reading this. If you want to explore further, here is a cartoon dataset that you might like: Danbooru 2017 dataset.