How to collect your deep learning dataset

Deep Learning has become the go-to method for solving many challenging real-world problems. It’s definitely by far the best performing method for computer vision tasks. These deep learning machines that have been working so well need fuel — lots of fuel; that fuel is data. The more labelled data we have, the better our model performs. The idea of more data leading to better performance has even been explored at a large-scale by Google with a dataset of 300 Million images!

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

When deploying your Deep Learning model in a real-world application, you should really be constantly feeding it more data to continue improving its performance. Feed the beast: if you want to improve your model’s performance, get some more data!

Increasing data consistently yields better performance

But where do we get all this data from? Well-annotated data can be both expensive and time consuming to acquire. Hiring people to manually collect images and label them is not efficient at all. And, in the deep learning era, data is very well arguably your most valuable resource.

Here I’m going to show you 3 ways to get your labelled data. These will be way more efficient than manually downloading and labeling images, allowing you to save both time and money. Once you have your base dataset going, it’s easy to snowball and build up a massive dataset to create a high-performing and robust deep learning model.

Let’s get started!

Scraping from the web

Manually finding and downloading images takes a long time simply due to the amount of human work involved. So what do we, as people who program computers, do when a task requires a lot of manual work? … We program it of course! We write code to automate the task!

We’ll use the example of collecting some kind of data for a computer vision task, such as object detection or maybe segmentation. Well, our task probably has some kind of common objects we would like to detect. And so that becomes our keyword for our web-scraping. It also becomes the class name for that object.

From the sounds of it this is of course very easy for a task such as image classification where the images’ annotations are quite coarse. But what if we want to do something like instance segmentation? We need labels for every single pixel in the image! To get those, it’s best to use some really great image annotation tools that are already out there!

The Polygon-RNN++ paper shows how you can create a model that, given a rough set of polygon points around an object, can generate the pixel labels for segmentation. Deep extreme cut is also quite similar except they use only the four extreme points around the object. This will then give you some nice bounding box and segmentation labels! Their code on GitHub is also very easy to use.

Another option is to use an existing image annotation GUIs. LabelMe is a very popular one where you can draw both bounding boxes and set polygon points for segmentation maps. Amazon Mechanical Turk (MTurk) is also a cheap option if you don’t want to do it yourself!


Since data has become such a valuable commodity in the deep learning era, many startups have started to offer their own image annotation services: they’ll gather and label the data for you! All you’ll have to do is give them a description of what kind of data and annotations you’ll need.

Mighty AI is one that has been doing self-driving car image annotation and has become pretty big in the space; they were at CVPR 2018 too. Playment AI are less specialized than Mighty AI, offering image annotation for any domain. They also offer a couple more tools such as video and landmark annotations.

Pre-trained networks

Many of use already know of the idea of transfer learning: start with a network pre-trained on a large dataset, and then fine tune on our own. Well we can use the same idea for collecting our new dataset!

The datasets that these pre-trained networks were trained on are huge; just check out the Open Images dataset with over 15 Million images labelled with bounding boxes from 600 categories! A network trained on this dataset is already going to be pretty darn good at detecting objects. So we can use it to draw some bounding boxes around the objects in our images. This cuts our work in half since all we then have to do is classify the objects in the boxes! Plus, with 600 categories, some of objects you desire to detect and classify may already be picked up at high accuracy with this pre-trained network. The TensorFlow Object Detection API actually already has a network pre-trained on Open Images if you’d like to try it out!

The snowball effect

So now you’ve collected an initial dataset. You training your network on it and put it into your product. It performs well enough to serve your needs, but it’s not quite as accurate as you’d like it to ideally be. Well now that you have a baseline network running, you can use that network to collect even more data! This network will perform better on your task than the general pre-trained one since you’ve fine-tuned it for your specific problem. Thus you can use it to collect more and more data even faster to make your network even better; a beautiful snowball effect!


Thanks for reading! Hopefully you learned something new and useful in learning how to efficiently collect data to train your deep network! If you enjoyed reading, feel free to hit the clap button so other people can see this post and hop on the learning train with us!

Source: Deep Learning on Medium