Source: Deep Learning on Medium
How to solve Google Colab and Drive timeout
If you are new in dealing with millions of files for doing data science using Google Colab and Google Drive, I have some news: it is not all about algorithms, training/dev/test sets or parameter tuning. System file management is a great deal to be successful in your data science project.
I have learnt something about it in the hard way while I was working on my thesis.
My thesis is about Natural Language Generation based on audio files, to do that I have 1.6 millions files.
Also, I want to use Google Colab`s GPU for training my model and Google Drive to store my data.
What I did first
I just uploaded all my data to the folder “Data” on my Google Drive using the next function:
And then this “for loop” uploaded every file on Google Drive.
What I found
A reading timeout error like the next one occurs:
This happens when you have thousands of files in your directory, and if you can not read your data, you can not train your model and your project fails.
The theoretical solution
My thesis supervisor recommended me to create subfolders to not damage the file system and to avoid the timeout issue on Google Colab when I try to read them.
How does it work?
Imagine you have 3 files and their names are:
The idea is to create subfolders based on the first letter of their names, then the second letter and so on.
Remember that my main folder is “Data”, I have three files and in this case, I will create two levels of subfolders.
First level of subfolders:
The first level is based on the first letter of each file, considering the files I have, my directory will have two subdirectory.
Second level of subfolders:
The second level is based on the second letter of each file, considering the files I have, each of my subfolders will have now:
Therefore, our final directories for reading each of our files are:
The code solution
“Lista” actually is a helper of helpers. It is useful for getting the names of subfolders (and files) and their ids. Following the example, if we apply it to “…/Data” we get the names of the subfolders (a and b) and their ids.
“Verification” tells how many subfolders “Main” function must create. Also it returns “helper_path” which is the id of the deeper subdirectory already created.
“Subdirectory” creates the subdirectory folders and it takes three parameters: “Helper_path”, “helper” and “i”.
“Helper_path” is the folder id of the last subfolder created, “i” is the i from the “for loop” and the letter in the i position will become the name of the subfolder that will be created.
“Load” uploads the file to Google Drive and it has three parameters, “last_path”,”file” and “filename”. “Last_path” is the id of the destination folder and you need it to put your file in the right place. “Filename” is the name that file will appear on your destination folder. “File” is the directory of your file on your hard disk and where the code goes to take the file to upload it. Also this function checks if the file already exists in your destination folder and avoids loading two times the same file.
The “main” function use all this helpers to serve this purpose in the next way:
- Remove whitespaces and lower capital letters if they exist. If you have a file whose name is “A day in the life”, the goal is to get “adayinthelife” for avoiding spaces and having subfolders like “A” or “a”.
- Load a dictionary where we store folder ́s destination ids. The goal is to speed up the process in case that the destination folder is already created. If destination folder exists, the code skips step 3 and executes step 4.
- We determine how many new subfolders we should create. For example, if we have a new file that is “aca” and we already have “…Data/a”, we only need to create “c”.
- We upload our data.
- We add a new destination folder id in our dictionary (in case we have just created it).
The final step
Personally I prefer to run the Organizer from another Jupyter Notebook. This extra Notebook named “Massive_uploader” is useful for logging in your Google Account, setting the Organizer and making a“for loop” for uploading all the files you want.
First, make sure you have enabled Google Drive API and then execute the next block of code. A new tab will be opened on your web browser.
Then it is time to define how many sublevels you want with “tree_depth” and the id of the folder where you will upload your data on your Google Drive with “first_path”.
You can get id (first_path) from URL. Imagine you are already on the folder where you want to create subfolders and store all your data. The id (first_path) is the one after the last “/”.
To get the data that you want to upload, first you define the path. Path is where you store your data locally. For example, in my case is “/home/facudeza/tesis”. Then, we retrive all the files that path has.
Then make sure you have downloaded Organizer notebook in your Jupyter directory and you call it.
Finally, you execute a loop for uploading your files.
You can speed up the process creating copies of Massive_uploader and executing them in parallel but be careful about not exceding your API quota.
This is this way I figured it out about how to deal with this problem and to ease reading my files from Google Colab.
If you have any advice, recommendation or comment, they are very welcome.
Thanks for reading.