Image Based Product Recommendation

Original article was published on Deep Learning on Medium

About the data

We were able to obtain rich and high-resolution fashion products data-set from Kaggle which consists of about 44k images of 143 unique and different classes like T-shirt, jeans, watches, etc. The size of the images is about 2400 x 1600 which makes it a pretty huge dataset posing a great challenge in Image pre-processing as well as training deep learning models.

Class wise Data-set distribution

Exploring low-level features

While building an image recommendation system, the first task is to identify what type of feature descriptors of an image needs to be taken into consideration. This may vary from the requirement to requirement. Some of the most commonly used feature descriptors are extracting color features, texture, and shape of an image. We used the following features for classification.

HSV histogram feature deals with the color distribution of the image between 0 and 255 pixels. The feature vector is of 255 x1 dimensions and represents the pixel frequency distribution across the image.

Edge detection feature corresponds to the edges of the images which are detected using the Sobel edge detection algorithm which in turn returns a feature vector of image size.

Low level features visualization

Image texture of an image is also an important feature for analyzing pixel distribution in an image. Our approach uses a Gabor filter to get an image size one-dimensional texture feature vector :

Histogram of oriented gradients (HOG) plays an important role in Object and shape detection. We receive a 3780 X 1 size one-dimensional feature vector.

Although these features were very good for representing an image in numerical form but still they were insufficient in retrieving more details from the image as per our requirement.

Similarity Computation and Generating Recommendations

We generate training vectors for each of the training and testing images. Once this is done, then for each test image we compute cosine similarity of the test image with leaders of all the 143 predefined clusters. Here the leaders of each cluster are chosen arbitrarily. We do this for all the different types of feature descriptors mentioned above. After that we calculate the mean of all the cosine similarities and then the top 5 most similar leaders are chosen. After this cosine similarity is computed between the test image and all the images in the 5 chosen clusters. This is again done for all the different types of feature descriptors mentioned above. Finally top k most similar images.

Accuracy achieved with this low level features model did not turn out to be pretty well as it could only achieve a mark of 51%

Digging Deeper with Deep Learning

The results achieved with low-level features were not convincing enough for us and therefore, we decided to explore techniques which may help us extract more descriptive and distinctive feature representation for the images which can be further used in computing similarity score based on which top K recommendations can be returned to the user. In a quest to find such feature representation, we landed at Deep Learning techniques which are quite powerful at extracting patterns and features from images.

Leveraging the power of pre-trained models

Before making use of our data-set for training any deep learning model, we decided to go ahead with pre-trained models and check how these models work on our data-set. For this purpose, we mainly used five of the most popular CNN based architectures namely VGG, Resnet, MobileNet, DenseNet, and Inception network.

VGG features model flow-chart

Features Extraction using pre-trained models

To use pre-trained deep learning models as feature extractors, the very first step was to remove its final output layer as we did not intend to use these models as classifiers. With this step, we are left with the output of a convolutional layer that is pooled and reduced using a global average pooling layer followed by a flatten layer to get a linear feature vector for an image.

The features vector are generated for test image and all training images and cosine similarity is computed and based on the top K scores, the corresponding top K images are returned as a recommendation

# defining Resnet and Vgg pre-trained Modelsdef resNetModel(height, width):
model = ResNet50(weights='imagenet', include_top=False, input_shape = (height, width, 3))
model.trainable = False
output = GlobalMaxPooling2D()(model.outputs)
model = Model(inputs=model.inputs, outputs=output)
model.summary()
return model
def vggModel(height, width):
model = VGG16(weights='imagenet', include_top=False, input_shape = (height, width, 3))
model.trainable = False
output = GlobalMaxPooling2D()(model.outputs)
model = Model(inputs=model.inputs, outputs=output)
model.summary()
return model

Performance Evaluation of Pre-trained models

On testing our pre-trained models on a test set of about 1200 images, we were able to obtain following results

We used precision and recall as evaluation metric for our models evaluation and based on the above result, we were able to conclude that VGG and Resnet were the best performing models for our data-set

Weighted Ensemble Technique

On evaluating our models, we could draw the conclusion that VGG and Resnet were the best performing models among all the models,therefore we decided to further work on these models to get better results.

To create a rich feature representation that involves the representation of both VGG and Resnet models, we used a weighted average of both the features to obtain the final feature vector. Since in case of Resnet, the size of the feature vector was bigger than the size of VGG’s feature vector , we used SKlearn’s SelectK feature reduction technique which selects top K features from a set of features based on the target value for the features.

Weighted Ensemble Technique flowchart

After finally generating the weighted feature representation, the same representation was used to compute cosine similarity between test and train images and get top K recommendations based on top K scores

We compared the standalone VGG and Resnet model’s performance with our ensemble model and the ensemble technique somewhat worked better than two standalone models

Ensemble model vs stand-alone VGG and Resnet comparison

As we saw an accuracy boost with this method, we tried this technique for different weight ratios and here are the results.

After trying multiple weight combinations, we didn’t observe much change in the accuracy and therefore we decided to move ahead to a newer technique,

CNN Classification Based Retrieval technique (CCBR)

As we had played with a lot of feature representations with pre-trained models, we decided to test another technique quite different from the previous ones where we first classify the input image and then generate recommendations based on the predicted class.

Training CNN model

As we got success with the pre-trained models, we decided to train our own CNN classifier over our data-set and use it for feature representations as well as predicting the class of the test image. As we already have the classes or true labels of all the images, we decided to train our CNN model on this data-set to build an image classifier . We divided 44K images in 80:20 ratio to get about 9000 images in test set and rest in training.

# CNN model's architecture
model = Sequential()
model.add( Conv2D(filters = 128, kernel_size=(3, 3), input_shape=X.shape[1:], padding='same', activation='relu') )
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add( Conv2D(filters = 128, kernel_size=(3, 3), padding='same', activation='relu') )
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))model.add( Conv2D(filters = 256, kernel_size=(3, 3), padding='same', activation='relu') )
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add( Conv2D(filters = 256, kernel_size=(3, 3), padding='same', activation='relu') )
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(output_size, activation='softmax'))

Classifying input image

Before we use the last layer’s output as feature representation, we used our model to classify the test image and gets it class. This class was used to know what training images are to be used for computing cosine similarity with

Generating Feature representation

For every test image, its class is predicted using the classifier. Then the last layer of the model is removed and the output of the last layer is flattened and used as a feature vector . This process is done for all test and train images