How to create a dataset in Google Colab for your Machine Learning projects



GC (Google Colab) is a recent tool by Google, released in order to make Deep Learning dedicated computation highly accessible. It is free and continuously developing. You can find every info about it at the official page, pardon notebook. In fact it is a IPython notebook living in a dedicated environment, installed on Linux, ready to be used. You can find a lot of resources on Medium, KDnuggets, and so on, about how to get started and advanced stuff too.

A always-free environment with IPython and GPU/TPU, ready to be used? Run to get it!

Now that you have the machine for computation, you need a dataset to play with. There are plenty of them around in the web, but what if you’d like to play with a very specific one, only for Deep Learning hacking purpose? Something like “marvel characters dataset”, or “soccer players bobbleheads” (..why?!😳), or even “sushi dishes”. It’s now clear that we are talking about images dataset, although you can adapt the process here presented to other dataset common types.

The main problems here are:

  1. how to automatically collect specific data
  2. how to be able to store such data in Google Colab for the experiment

and we are going to see how to solve both.

Requirements

  • An active Google account, so that you can access to your Google Drive folder.

The strategy

What I will explain below will refer to this Colabook (Google Colab notebook), already present online, which you can copy in your drive to have fun.

We will face the first problem (data collecting) scraping from Google Images, whilst we’ll overcome the second one (data storing) using Collaboratory tools.

Disclaimer: the content of this notebook is for informational use only. I recommend anyone who has a need for massive or frequent use, to consult the Google Cloud page or the Custom Search API.

The code

Our scraper will be google images_download, a beautiful and very easy to use script. It uses Selenium, a browser library, to scrape from the web, but don’t worry, no other programs will be open, since it acts in the background.
By the way: it is NOT a official Google package.

The steps to accomplish the mission, as followed in the Colabook:

  1. Install google_images_download
  2. Download Chromedriver
  3. Set the chromedriver path for the script
  4. Scrape and check!

Install google_images_download running the typical pip formula.

$ pip install google_images_download

Afterwards download the latest Chromedriver official release from the web site, unzip it and get its path in the Colab; everything by code, of course.

!wget https://chromedriver.storage.googleapis.com/2.42/chromedriver_linux64.zip  && unzip chromedriver_linux64

The Colabooks store data inside the content/ folder, so…

Now let’s mount the Google Drive driver, in order to manage the file storage. We will use colabtools module by Google, a tool set still in development but very powerful yet.

When you’ll run this code a link and a text-field will appear.

Google Authentication. Follow the link in order to get the auth code to paste in the text field under the link.

it is an authorization link by Google, so click on it. After a couple of authentication steps you will see the auth code, copy it and paste it in the text field you saw in the Colabook cell (check the image above), finally press Enter.

Now set the scraper and run it. The code below shows an example of usage for searching for 50 images of “dogs with hats”. Run it and keep going.

Don’t you think dogs with hats are funny? 🐶

Out of this example you will find a lot of other arguments, to set search keywords, in the official page of the project, check it.

At this point, if no errors came out, we should have our dataset ready to be used, somewhere in the Colabook environment. Let’s get it.

Now the last step, to be sure the code worked well let’s peek a sample and visualize it with matplotlib.

It will return this pretty guy.

Probably a rare picture of King Doggy

Voilà! enjoy your brand new doggy pictures in Colabook for your weird experiments. Please let me know which strange dataset you’ll make! 😃

References

I found an interesting post about file managing in Colabooks, for anyone who wants to go deeper.

https://medium.freecodecamp.org/how-to-transfer-large-files-to-google-colab-and-remote-jupyter-notebooks-26ca252892fa

Source: Deep Learning on Medium