Applying Deep-Learning for fashion e-commerce

Source: Deep Learning on Medium

Go to the profile of amit peshwani
Image courtesy :

When I was learning about Unsupervised Learning methods I came across different clustering methods like KMeans , Hierarchical Clustering . During the learning phase I wanted to implement this method on real-world problems. Also E-commerce systems have been keeping my mind occupied for a while and it was very engaging to know How the system works. So I decided to work with a problem which I will describe in this article .

In this blog, I will share my experiences and learnings on automating the process of choosing thumbnail image for fashion e-commerce (footwear) .It will cover following things :

  • How to generate features from image using Deep Learning for clustering process
  • How to find optimal cluster number for KMeans (Elbow Method)
  • Which architecture to choose for feature extraction

Problem Statement :

Imagine working for a fashion e-commerce company and getting thousands of footwear images everyday by vendors . It is important to choose the correct image for the thumbnail so that it attracts the user to further view or buy the product.

Lace-Up shoes .

The task is to identify the best view image which will be used as a thumbnail image from bunch of images of footwear and also classify the type of footwear . In the above image , right side image is the best image and it will be used as a thumbnail image .

Why this is necessary ?

  • Each product has multiple views (front,back,left,right) and only a few have the information about the product, so choosing the best view of the product will be the thumbnail image . It will automate certain processes which are done manually .
  • Choosing Best View is also beneficial when training a classifier to identify the type of footwear. By doing this classifier will only learn features from the view which has the most information like front or side view . If you feed the classifier the view which describes the sole of footwear , than it will not learn the features which are more important to classify the type of footwear . Like for Lace-Up shoes, the view which has the information of lace will be the most useful feature for classifier and not the information of sole.


Data was provided by Fynd-Fashion e-commerce company. CSV file with column name ‘class’ and image URLs to different views (5 different views). Downloaded images from URLs and saved the images in following format : footweartype_id_viewid.jpg . Also it is not necessary by seeing the image , that view_1 will be always front view of footwear .

There are total 6 classes of footwear in dataset .

Image Courtesy : Fynd . Types of Footwear


In the data , we are not given any information that which view was chosen for thumbnail image . We are only given different views of a particular footwear and which type it is . So how to train a classifier if the data is not labelled with best view? This type of problem belongs to unsupervised learning . And for unsupervised learning , we can use methods like Clustering to label the data .

Before jumping to any of the unsupervised methods it is necessary to generalize the problem so that the approach becomes simple and after observing the dataset , I came across certain things :

  • For lace-up , Slip-on , Buckle , Hook & Loop ,Backstrap type of footwear , front or side view will be the most important view .
  • For zipper type , view which has the information of zip will be the most important view .

Now we have decided certain things , that will generalize our problem . Now moving forward to unsupervised methods to identify the above views for type of footwears .

How to use clustering ?

Now the task is to cluster images with same views together . I decided to use KMeans for clustering . To use clustering algorithm like KMeans we have to pass certain features of images . Features like Shape context, GIST, HOG, heuristics and feed those features to clustering algorithms . After visualizing the clusters for every type of footwear , there were some bad images which don’t belong to those clusters .

After all this I decided to use Deep learning methods for clustering . What? How?

The basic idea is to pass the images to pretrained network and than remove the top layer of the model and pass the output of last layer to the clustering algorithm .

The benefit of using pretrained Deep Learning networks is the abstractions learned in the layers would capture everything, shape, patterns, etc . So we don’t have to manually extract features from images . It’s like magic .

There are different pretrained models like VGG16 , VGG19 , ResNet50 ,etc . To decide the which architecture to choose we can use Silhouette Score to find the best cluster .

Silhouette Score is a way to measure how close each point in a cluster is to the points in its neighboring clusters. It’s a neat way to find out the optimum value for k during k-means clustering. Silhouette values lies in the range of [-1, 1]. A value of +1 indicates that the sample is far away from its neighboring cluster and very close to the cluster its assigned. Similarly, value of -1 indicates that the point is close to its neighboring cluster than to the cluster its assigned. And, a value of 0 means its at the boundary of the distance between the two cluster. Value of +1 is idea and -1 is least preferred. Hence, higher the value better is the cluster configuration.

In the above graph, as the number of clusters were increased it was observed that ResNet was performing better than VGG16 . Because ResNet50 is more deeper and it captures more information .

Visualization of clusters

Passing features from ResNet , and find the cluster with same views .

Same Views of Hook Type Closure
Same views of Laceup Closure
Same views of Backstrap closures
model = ResNet50(weights='imagenet', include_top=False)
# function to load a particular type of footwear . And pass the return the features from last layer
def get_vec_footwear(footwear_dir):

resnet50_feature_list = []
filenames = listdir(footwear_dir)
for i,fname in enumerate(filenames):
try :
img = image.load_img(footwear_dir+'/'+fname,target_size= (224,224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data,axis=0)
img_data = preprocess_input(img_data)
resnet50_feature = model.predict(img_data)
resnet50_feature_np = np.array(resnet50_feature)

except IOError :

resnet50_feature_list_np = np.array(resnet50_feature_list)
return resnet50_feature_list_np
#Feature vectors from above function is passed into this function to get clusters
def get_clusters(a,b,resnet50_feature_list_np):
silloute_score = []
objects = []
cluster_errors = []
for i in range(a,b):
kmeans = KMeans(n_clusters=i, random_state=0, n_jobs=-1).fit(resnet50_feature_list_np)
silloute_score.append(metrics.silhouette_score(resnet50_feature_list_np , kmeans.labels_,metric='euclidean'))
cluster_errors.append( kmeans.inertia_ )

return silloute_score , objects , cluster_errors
resnet50_feature_list_np = get_vec_footwear('lace_data_rgb')

silloute_score , objects , cluster_errors = get_clusters(2,20,resnet50_feature_list_np)

In the above code, We pass the name of directory to the get_vec_footwear function . It returns the feature vectors of an image . After getting the feature vectors, they are passed to get_clusters function which will pass feature vectors to clustering algorithm .

You can find further implementation here :

For each type of footwear , optimal number of clusters were decided on the basis of Elbow analysis of Silhouette Score .

The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (k-Number of clusters), and for each value of k calculate the sum of squared errors (SSE). Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then the “elbow” on the arm is the value of k that is the best.

In the above graph, optimal value of K will be 5. After the value of k=5 SSE tends to decrease toward 0 as we increase k . So goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k.

On the basis of clusters, images were separated into

  • Typeoffootwear_bestview
  • Typeoffootwear_nobestview

By following above methods , I seperated images with same view for every type of footwear together and saved into the folder with the name typeoffootwear_best/nobestviews(laceup_best in above image) . So we have labelled data with the best and no best views for every type of footwear . Now we can train a classifier which not only will identify the best view but also classify the type of footwear . Turning the problem from unsupervised to supervised learning problem .

Trained VGG19 on the above images . Here is the example output of model :

Hook & Loop

Final Words

I used Pre-trained ResNet50 on the basis of Silhouette Score. Output of resnet was directly feed into clustering algorithm . Optimal cluster number was found using Elbow Analysis. Then, separated images on the basis of cluster and labelled each cluster with particular view. Now having the labelled information of view with type of footwear, used normal VGG architecture for classification task. Further Improvement: Instead of using pre-trained resnet which is trained on imagenet dataset, we can use model which is trained on fashion-MNIST dataset because it will capture more information of shape and texture of footwear and use the model for clustering.

Hopefully , this article has provided an overview of my implementation to solve this particular problem . Feel to free to correct my implementation .

Thanks for reading ,

Amit .