Complete Pipeline for Bounding Box Detection for fashion material (YOLOv3)

Original article was published on Artificial Intelligence on Medium

Complete Pipeline for Bounding Box Detection for fashion material (YOLOv3)

Hello folks, I recently worked on bounding box detection for fashion material and learn so much stuff. I know most of the stuff is available online but I am trying to bring all those things in one place also I will share my learning and take you through this pipeline.

Low bias or variance in your dataset, higher is your model accuracy


  1. DeepFashion1 is a large scale cloth dataset, It contains around 800k diverse images of 46 popular clothing categories.
  2. DeepFashion2 is a comprehensive fashion dataset, It contains 491K diverse images of 13 popular clothing categories.


Both these datasets are organized in a different manner. DeepFashion1 is like one image that has one bounding box and all annotations are stored in one text file whereas DeepFashion2 is like one image that has one or more than one bounding box and each image has its own annotation file in JSON format.

import pandas as pd

Everyone is familiar with this command. I combined these two datasets and made the dataframe in the format shown below. Assign Id to each class and mode column is to know whether this image belongs to the training, test, or validation data. Source is to know whether this image is from deepfashion1 or deepfashion2.

I’m not sharing my code because the idea behind this blog is to enlighten you through the pipeline. Still, if you are facing problem while coding just let me know, I will help you out.

Handling such a big dataset is hectic and time taking. Building dataframe process will take so much time. DASK is an open-source library for parallel computing written in Python. A glimpse of Dask shown below.

Tfrecord : The tfrecord format is simple format for storing your data in a binary sequence. You should have a basic idea of how to create tfrecord. In our dataframe there are annotations that belong to one image, our task is to group them together using python collections library and then create tf example. Here’s an idea of code that will help you in the grouping.

Because our dataset is large, our tfrecord creation script will take time. We can use DASK for parallel computing, also we can shard our tfrecord i.e. splitting tfrecord file. Sharding of tfrecord best explained in this blog with code.


YOLO9000, which could detect up to 9,000 object categories using the improved YOLOv2 model. At 67 frames per second, the detector scored 76.8 mAP(mean average precision) on the visual object classes challenge VOOC 2007, beating methods such as Faster RCNN.

Loss function: Sum-squared error

Classification loss: Cross-entropy

You will find complete architecture in this git repository. Make changes as per your tfrecord file and also fine tuning of hyperparameter is required.

We can train our network from scratch or we can use pre-trained weights available online. Use “eager_fit” mode while training because eager mode is great for debugging.


Evaluation metrics are used to measure the accuracy of your model. There are many different types of evaluation metrics available to test a model. It is important to use multiple evaluation metrics to test a model. This is because the model may perform well on one evaluation metrics but may perform poorly on other evaluation metrics.

I used mean average precision( mAP ) evaluation metrics. This blog explained mean average precision very nicely. Have a look

Mean average precision formula

where AP( Average precision) and N is the number of class in your dataset.

Write a python script that will predict the bounding box for the test dataset using trained model and save it to one folder also save ground truth into another folder. Clone this git repository. Feed your predicted data and ground truth into python script in this git repository. This will give you mAP value on your terminal.