Source: Deep Learning on Medium
In this post, we will leverage a pre-trained Convolutional neural network model to extract the feature vectors. The features form a perfect representation of the image such that related images will have similar features.
Brief Introduction to Convolutional Neural Networks
A Convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyze visual imagery.CNN could be used to extract higher-level representations of the image content. Instead of preprocessing the data to derive features like textures and shapes, a CNN takes just the image’s raw pixel data as input and learns how to extract these features, and ultimately infer what object they constitute.
Leveraging Pretrained Models
Training a convolutional neural network to perform image classification tasks typically requires a considerable amount of training data, and can be very time-consuming, taking days or even weeks to complete. But we leverage existing image models such as vgg16 which are trained on enormous datasets and adapt them for use in our classification tasks.
One conventional technique for leveraging pre-trained models is feature extraction. We copy the model and remove the last layer which is the classification layer.
The final output layer of the new network is the second 4096-neuron fully-connected layer, “fc2 (Dense)”. Output feature is a 4096-element array of numbers for each image.
These extracted feature vectors will have some correlation and information redundancy among these features, and higher dimensional features lead to higher computation complexity. Therefore, we adopt principle component analysis (PCA) for selection and dimensionality reduction of features. Also Operating over 4096 elements is inefficient both in terms of space/memory requirements and processor speed, and it would be better for us if we can reduce the length of these vectors but maintain the same effective representation. PCA allows us to do this by reducing the dimensionality down of the feature vectors from 4096 to much less, but keep a description which is still true to the original data, by preserving the relative inter-point distance.
Comparing an image with millions of others for similarity
Generating features for two images using VGG16 model and calculating cosine similarity is pretty fast and takes about a couple hundred milliseconds on a GPU instance. But creating similarity scores for a million pairs on the fly even if the image features are pre-extracted and stored has a latency in the order of minutes.
There are few popular methods for approximate KNN like an elastic search for vectors, local sensitive hashing, DIMSUM, Annoy is a library the fetches(approximate) nearest neighbors, which is fast compared to our earlier approaches.
A quick recap of the steps we did. Prepare our image database. Download the trained VGG model, and remove its last layers. Convert our image database into feature vectors using our dissected VGG model. If the output layer of these products model is convolutional filters, then flatten the filters and append them make a single vector. Compute similarities between our image feature vectors using an inner-product such as cosine similarity or form distance For each image, select with the top-k similarity scores to build the recommendation. Link to my code in GitHub.