Google Colab + Drive as persistent storage for long training runs

Source: Deep Learning on Medium


Not all people have a Deep Learning Rig or lots of credits in cloud services to have hardware accelerated computing. Google Colabaratory is a service which provides Tesla K80 GPU runtimes for free but training deep neural networks from scratch can be a pain with the limitations. So this post is about using Google Drive to have your dataset(in upload only once mode) and save checkpoints to resume long training whenever the instance gets disconnected.

Why use Google Drive ?

  • Google Colab, provides maximum gpu runtime of 8~12hours “ideally” at a time, it may disconnect earlier than this if they detect inactivity or when there is a heavy load.
  • Its acts a persistent storage for the Colab VM, so that you wont lose your trained data.
  • Load your dataset once and use it hassle free whenever you reconnect to a new runtime.

Mounting Google Drive

To mount your drive, run these 2 lines of code which will prompt for a authorization code and link to obtain that auth code, copy paste it, press enter and you have mounted your drive.

from google.colab import drive
drive.mount(‘/content/gdrive’)

Accessing Files from Google Drive

  • /content/gdrive/My Drive this is should be parent directory to access the files in drive.
  • For example if you have folder called Sample and file named sample.csv, then the path is /content/gdrive/My Drive/Sample/sample.csv

Storing and Loading Dataset from Drive

  • Always upload zipped file as it is easier to unzip and a lot quicker.
  • Once the drive is mounted you can unzip the file named sample.zip in Sample folder by !unzip '/content/gdrive/My Drive/Sample/sample'
  • Unzipped files will be in the /content/ directory.
  • So, if there was a file called train.csv in the Zip file, when unzipped it can be accessed using path '/content/train.csv'

Saving Checkpoints using Callbacks in Keras

from keras.callbacks import *
filepath="/content/gdrive/My Drive/MyCNN/epochs:{epoch:03d}-val_acc:{val_acc:.3f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
  • filepath : is for folder called MyCNN in drive and each file will be stored with epoch number and validation accuracy, these files contain weights of your neural network.
  • ModelCheckpoint : for the arguments passed in the above code it is monitoring validation accuracy and stores when a higher validation accuracy is achieved than the last checkpoint.
  • callbacks_list : made it a list so that you can append any other callbacks to this list and pass it in fit/fit_generator function as highlighted in the below code block while training (all the methods in the list will be called after every epoch).
model.fit_generator(datagen.flow(x_train, y_train, batch_size=64),
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test),
callbacks=callbacks_list)

For more details on callbacks,checkpoint visit Keras Docs.

Resuming training once disconnected

  • Mount the drive in the new runtime
  • Use the same architecture and create the model
  • model.load_weights('/content/gdrive/My Drive/MyCNN/epochs:047-val_acc:0.905.hdf5') loading weights from a checkpoint where at 47th epoch it had reached a new high of 90.5% validation accuracy.
  • Then compile the and fit the model resuming from 48th epoch and so on

Hope this helps training on colab without much to worry :)