A Model with an ‘Eye’ for Fashion

Source: Deep Learning on Medium

A Model with an ‘Eye’ for Fashion

A project by Anisha Alluru, Elizabeth Reid Heath, Manas Rai , Ravikiran Bobba and Vishal Ramachandran


This blog post aims to walk you through the implementation of instance segmentation on images to identify different fashion objects in any frame. The challenges and learning involved in image instance segmentation and the business implications of this project in rapidly evolving fashion industry motivated us to work on this project as a part of Advanced Predictive Modelling course at the University of Texas at Austin. Please follow this GitHub Repository for our implementation code.


“Playing dress-up begins at the age of 5 and truly never ends”. People are always on the lookout for the latest fashion trends so they can look their best. In this ever-changing world that we live in, where celebrities and Instagram influencers have a huge impact on the way we live our lives, it is nearly impossible to keep up with the latest fashion trends. Not only is it challenging for consumers, but also for the designers and manufacturers who might end up losing their market upon failing to identify these trends before the everyday consumer. This opens the opportunity to leverage computer vision algorithms on large data to achieve knowledge and expertise unattainable for individual fashion experts.

A Kaggle competition based on the same problem has challenged us to develop a model that will help with an important step towards automatic product detection — to accurately assign segmentations and attribute labels for a given fashion image and localize pixels where the object is present.

Through this project, we developed a deep learning model that can analyze images, films, and eventually real time data that can analyze what a person is wearing and identify each element of the outfit.


This dataset taken from the Kaggle competition iMaterialist 2019 has a blend of snaps from daily life to celebrity events, and online shopping. The training data has 45,625 of them while the test data set has 3,200 images.

To capture the intricate structure of fashion objects and ambiguity in descriptions obtained from crawling the web, the taxonomy was made with 27 main apparel objects (jackets, dresses, skirts, etc), 19 apparel parts (sleeves, collars, etc), and 92 fine-grained attributes. An initial exploration showed that about 40k images had over 4 apparel objects, and neckline, sleeve and shoe were the most represented ones.


A mask is an overlay on top of an image for localizing the position of each object in it. The annotations of these masks were in Run-length encoding (RLE) format, a format of data compression where the 2-D overlay array is flattened to a 1-D array with the ability to perfectly reproduce the original format.

Sample training image and mask labels after RLE decoding


Following were the steps involved in data pre-processing:

  1. The data was transformed from being unique at an image and label (object instance) level to image level by coalescing the labels.
  2. The images were resized to 512*512 as a trade-off between reducing computational time and retaining information from the original image.
  3. The label masks were extracted by rejigging Run-length encoding into 2-D pixel arrays.

4. Computer vision tasks rely on plethora of data for enhanced performance and generalization. Keeping best augmentation practices in mind, we performed flipping, cropping, sharpening, Gaussian blurring and rotation on the images preserving most of the information and at the same time adding diversity in the data.


To tackle real world problems, CNNs have transformed over time to do more than a normal classification task. Beyond normal classification task, CNN’s application has extended to more meticulous computer vision applications:

Different computer vision tasks
  • Semantic Segmentation: Predict the class of each and every pixel of the image
  • Classification and Localization: Predict the class of the object(single) in the image and locate the object with a bounding box inside the image
  • Objection Detection: Detect the objects in the image and localize each of them with a bounding box
  • Instance Segmentation: Combination of object detection and semantic segmentation, identifies the objects present in the image and segment them at the pixel level

Justin Johnson’s lecture on detection and segmentation talks about this in detail and explains different convolutional network architectures used for each of these applications.

For our project we chose to use Mask R-CNN, which is an instance segmentation framework. This article “Computer Vision — A journey from CNN to Mask R-CNN and YOLO -Part” gives a detailed explanation on evolution of Mask R-CNN.

Mask R-CNN

Mask R-CNN Architecture

The block diagram above represents the Mask R-CNN architecture. A brief description of each of the steps is given below:

  1. The image is passed through a convolutional network for processing.
  2. The output of first conv net is passed through to a Region Proposal Network (RPN) which creates different achor boxes (Regions of Interest) based on the presence the objects to be detected.
  3. The anchor boxes are sent to ROI align stage (one of the key features of Mask RCNN for protecting spatial orientation), which converts ROI’s to the same size required for further processing.
  4. This output is sent to fully connected layers which generates the class of the object in that specific region and the location of the bounding box for it.
  5. The output of ROI align stage is parallelly sent to conv nets in order to generate a mask of the pixels in the object.

Mask R-CNN Loss Function

Mask RCNN uses a complex loss function which is calculated as the weighted sum of different losses at each and every stage of the model. Understanding these loss weights, helped us fine tune our hyperparameters in an efficient manner. Below is an overview of the loss weights:

  • rpn_class_loss: This loss is assigned to handle improper classification of anchor boxes (presence/absence of any object) by Region Proposal Network, and should be increased when the model fails to capture an object.
  • rpn_bbox_loss: This corresponds to the localization accuracy of RPN. This should be tweaked if the object is detected, but the bounding box is incorrect.
  • mrcnn_class_loss: This loss is tailored to tackle improper classification of objects present in the region proposal, and should be tuned when the object is detected from the image, but misclassified.
  • mrcnn_bbox_loss: This loss penalizes inaccurate localization of the bounding box of the identified class, and should be increased if classification of the object is correct, but localization is not precise.
  • mrcnn_mask_loss: This corresponds to masks created on the identified objects. If identification at pixel level is of importance, this weight is to be increased.

Please refer to our comprehensive article that talks about various hyper-parameters in Mask R-CNN along with approaches for tuning based on our learnings from this project.


Our evaluation metric was mean Average Precision (mAP) over a range of IoU thresholds. IoU measures the overlap between the actual & predicted regions, which is equal to the area of overlap over the area of union. The mAP score is then calculated for every ClassId, sweeping over different IoU thresholds ranging from 0.5 to 0.95, with a step size of 0.05, using the below formula:


  • TP = When a predicted object matches the group truth with an IoU above our threshold t
  • FP = When the predicted does not have an associated ground truth (i.e. predicting something that is not there)
  • FN = When a ground truth has no associated predictor (i.e. not predicting something that’s there)


Phase 1: The Beginning and Approach

As Andrew Ng says, build quickly, then iterate; we started with a randomly sampled dataset of 5000 images to create the basic block of the model and tune the parameters.

We used Mask R-CNN’s implementation of Matterport, because of its comprehensive documentation, and open source user support. Please refer to their GitHub Repository to learn more about their implementation.

Leveraging transfer learning, using open source pre-trained weights for COCO dataset, we kick started our model development. We started with updating and training the output layer alone. The test mAP score stagnated at 0.04 and was nowhere close to other average scores we observed on Kaggle. The pre-trained weight from COCO dataset did not work well owing to the nature of our training data.

Phase 2: Training all layers and hyperparameter tuning

We switched to training all the layers using our data subset, which leads us to the most important part of any deep learning project.

Hyperparameter tuning to the rescue!

If going gets tough, tougher parameters get going (pun intended). After wandering initially with heuristic tuning approaches, we badly wanted an RMSprop (pun intended again) in our lives which can give better direction to our model tuning. Below are the list of parameters specific to Mask R-CNN that we tuned:

  • Loss Weights — The multi-task loss function of Mask R-CNN combines the loss of classification, localization and segmentation mask. Through the course of this project, we learnt to appreciate the impact of tuning them along with the learning rate update rules. Setting a high penalty for MRCNN class losses helped us attain higher prediction accuracy.
  • Gradient Clip Norm — It constrains the gradient values (element-wise) to a stipulated range. If we have a “bad minibatch” that would cause gradients to explode, the clipping prevents that iteration from affecting the entire model. Unfortunately, it didn’t help our cause much.
  • Detection Min Confidence — This defines the probability threshold over which the model affirms the presence of an object. We tried values between 0.5–0.8 and got the best results at the default value of 0.7.
  • Detection Max Instances — Limiting the number of classes that the algorithm predicts for an image, is also a deciding factor as it helps in defining the trade-off between too much and too less of the predictive power required from the algorithm. Based on our problem, we set it to 50.

From the dozens of combinations of parameter values examined, we scaled the optimal one to train for 20 epochs on 15,000 images. Fusing this with extensive image augmentation and mindful modification of learning rate for faster convergence led us to our best outcome, a test score of 0.082. Below is the configuration that gave us this result.

Configuration of the tuned model

Phase 3: Chasing Error

After getting a good prediction model which captured dominant labels such as sleeves, t-shirts, pants and shoes accurately, we decided to manually examine mistakes that our algorithm was making. This gave us insights into what to do next.

Enter error analysis!

We observed that our model was not doing a great job in capturing the classes which had relatively lesser training data; the standard imbalanced class problem. The classes we were missing were the low represented classes like zippers, skirts, bags and belts.

Sensitivity chart based on classification results of the model — sensitivity decreases as the number of training data decreases

The miss 🙁

To improve the model prediction for low accuracy labels, we first trained a separate model on these class labels and took the output as the final prediction for those classes, but unfortunately it didn’t improve the predictions from the algorithm.

The hit 🙂

In the other approach, we trained our tuned model on a new sample where we increased the representation of underrepresented labels such as zippers, skirts and bags. After training for 10 more epochs, though the improvement in the overall prediction score was small but improvement in the classification of underrepresented classes was significant. The pre and post sensitivity analysis in the below picture depicts this story.


We were able to build a fairly accurate model which can predict the outfit labels in a frame and localize them precisely.

A demo video showcasing our model’s capabilities (Our “model” on the “MODEL”)


Every deep learning project is bound to pose numerous roadblocks. Here were ours:

1. Instance segmentation — Since we had minimal prior experience with deep learning, unravelling the entire concept through the scattered learning resources within 3 months was a herculean task. Also, every fragment of the code we gathered from GitHub and Kaggle demanded hours of uninterrupted effort to get a grasp of it.

2. Google Drive limitations — Matterport’s implementation of Mask R-CNN was in TensorFlow 1.3. Midway through realizing our project goals, GCP deprecated GPU support for TensorFlow 1.3 and the project’s tight deadlines denied us the time required to make the changes needed for executing their implementation in TensorFlow 2.0. For employing Google Colab as the alternative, we uploaded the 19GB of images to G-drive only to discover the time out error (when retrieving >15k images) restraining us from utilizing all the data, thereby necessitating us to split the input into smaller chunks as a workaround.

3. Predicting the minority classes — Since deep learning models demand a myriad of data for its learning, the sparse presence of certain labels was quite a setback until we resampled the data to surge their representation.


We learnt a great deal from overcoming these challenges, here are the major ones:

  1. Start with a smaller subset: Mask R-CNN is a heavy-weight model designed for accuracy rather than memory efficiency. Training one epoch on 5000 images took us more than an hour on Tesla K80 GPU. To be faster and more efficient, parameter tuning should be effected on a smaller subset before scaling the best combination on to the larger dataset.
  2. Transfer learning — the right pre-trained weights: Leveraging weights trained on a data which is similar to one’s problem makes transfer learning more effective. Since our data had more fine fashion attributes, pre-trained COCO weights showed limited effect on performance.
  3. Handling prediction for imbalanced classes: Our first approach of building a separate model for underrepresented classes didn’t help our cause but training our tuned model further on oversampled minority class data improved the prediction accuracy overall. This is more of trial and error, both the approaches could be tried.
  4. Taming the loss weights: Controlling the parameters of the optimization function aptly would produce radically better results. Identifying and understanding proper loss weights had a huge impact on our model performance. Comprehending the mistakes through manual examination of the results is useful. For e.g. If the model is misclassifying a lot of objects, that translates to a need for increasing the weight for mrccn_class_loss. Similarly, the problem of not capturing all the pixels of an object could be addressed by increasing the weight of mrcnn_mask_loss.
  5. Climb up the competition leaderboard: Having a good understanding of competition’s evaluation metric can help in improving leaderboard score. In our data, the fine-grained attributes were present only in 3% of the total images, which is too low for the model to detect and our loss metric was penalizing for both FP and TN when we fail to predict these attributes along with the main objects. To escape this double penalization, we removed all our predictions with objects containing fine-grained attributes, which boosted our test mAP score from 0.082 to 0.095 in the competition, a lift of ~15%.


  • Smooth resizing: Regular cv2 image size reduction is leading to some information loss specifically for the cases of finer objects. Other techniques like max pooling or average pooling can be applied at image resize stage to preserve maximum information.
  • Filtering labels using classification algorithms: A VGG-16 or Mobilenet prediction can be leveraged to filter out low confidence Mask R-CNN prediction outputs.
  • Other Architectures: More advanced instance segmentation architectures like “Cascade Mask R-CNN” or “HTC + ResNeXt-101-FPN + DCN” could be tried to enhance prediction accuracy.


  • Apparel Search through Phone App: Working on the concepts of Google Lens — Shoppers can search for products using their phone camera.
  • Fast-fashion Trend Analysis: Retailers can study emerging trends in fashion and host them in their product assortment before anyone else.
  • Product Recommendation Engine: Can help retailers recommend to their shoppers, products with similar attributes to the one they’re looking at. For example, they can recommend alternatives to out-of-stock products, so customers don’t bounce off their website easily.

At the end, we would like to thank our professor Dr. Joydeep Ghosh who guided us through the entire process. We hope this article helps you in understanding how to approach an image instance segmentation problem better. We would love to hear your feedback and suggestions.