Google Colab: Work with large datasets even without downloading it!!

Source: Deep Learning on Medium

Google Colab: Work with large datasets even without downloading it!!

Image from Google Images

Yeah you read it right! In the end you are going to work with huge datasets to train your model without downloading it in your machine, Google drive, Cloud, etc.

At the very initial step we used to work with inbuilt datasets like keras datasets, sklearn, MNIST, CIFAR10 and further. But the real challenge is to work with messy dataset from all around the world, and bring the best out of it.

This sounds very pretty, but lemme tell you we also used to download the large dataset in our local machine or cloud services. Which is very irritating for developers like us.

I am going through Udacity Deep Learning Nanodegree materials(course). In this course there is one project named “Dog breed classification” in this i have to download a large size of dataset around 1.05GB. And then after i have to download the VGG16 file named DogVGG16Data.npz which is around 860MB. Which is time consuming and also increases carbon emissions.

So, i decided to find a way where i don’t need to download these things particularly, and also i can train my model rapidly without wasting my time.

So you can follow these steps:

Step 1: Open your Jupyter Notebook

Right now i am using Google Colab to train my models, as i am a beginner.

Step 2: Download the dataset using wget

!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip

This will download the dataset in 2–3 minutes!

As it was a zip file i have to extract that in my Google Drive. For this you can write this code:

from zipfile import ZipFile

file_name = “dogImages.zip”

with ZipFile(file_name, ‘r’) as zip:
zip.printdir()

print(‘Extracting all the files now’)
zip.extractall()
print(‘done!’)

Easy isn’t it. 😛

Step 3: Download in certain directories

Here first i have created bottleneck_features folder and then i downloaded my DogVGG16Data.npz file in it.

To create a folder in Google Drive:

import os
os.mkdir(“/bottleneck_features”)

And in the next cell i downloaded my particular file.

!wget -P /bottleneck_features https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/DogVGG16Data.npz

Step 4: Train your model

Here is the link of my Github repo:

Hope you enjoyed reading this blog.