Source: Deep Learning on Medium
Creating a dataset from the internet from scratch is very annoying. Scripts, noisy data and time-consuming pre-processing are the norm. That’s why, together with Jeremy Howard, we built the google_images_dataset notebook which allows you to easily download images from Google, without violating its Terms of Service.
The first version of the notebook had a step by step instruction guide and code to easily download the images for each category and then train a model with the fast.ai library. This was useful but we saw that the data in the web was really noisy (no wonder!) and this negatively affected the performance of our models.
With this in mind, Zach Caceres and Jason Hendrix from the fast.ai SF Study Group developed an Image Cleaner that allows the user to delete images that do not belong to the dataset and relabel those that are incorrectly labelled.
In the meantime, I developed a duplicate detector that allows you to easily compare the most similar images in your dataset and delete those that are actually duplicates.
You can find the notebook to create your dataset in my repo here and the notebook to clean you dataset here. Notice that you will need to first install the fastai library to be able to run it. There’s one excuse less for getting started, what are you waiting for?