Kaggle vs. Colab Faceoff — Which Free GPU Provider is Tops?

Source: Deep Learning on Medium


Specs, UX, and deep learning experiments with fastai and mixed precision training

Google has two products that let you use GPUs in the cloud for free: Colab and Kaggle. They are pretty awesome if you’re into deep learning and AI. The goal of this article is to help you better choose when to use which platform.

Kaggle just got a speed boost with Nvida Tesla P100 GPUs. 🚀 However, as we’ll see in a computer vision experiment, Colab’s mixed-precision training helps to close the speed gap.

In this article we’ll show you how to compare hardware specs and explore UX differences. We’ll also compare training times on a computer vision task with transfer learning, mixed precision training, learning rate annealing, and test time augmentation.

Let’s get to it! 👍

Twin peaks Colab and Kaggle, side by side in the Google range

Kaggle and Colab are fairly similar products. Both Kaggle and Colab

  • offer free GPUs
  • provide Jupyter Notebooks in the browser — albeit with their own unique flavors
  • are designed to foster collaboration for machine learning.
  • are Google products
  • are imperfect, but are pretty useful in many situations — particularly when you are starting out in deep learning. 😄
  • don’t provide great info on their hardware specs

The last point is one we’ll dig into in a moment. Unfortunately, neither Kaggle nor Colab tells you exactly what specs you get when you use their environments. The docs that do exist often are out of date (see here as of March 11, 2019). Further, the widgets on screen tell some of the story, but differ from what I unearthed. I’ll show you common profiler commands you can use to see your environment’s specs.

First, a little background on GPUs — if this is old hat 👒 to you, feel free to skip ahead.

What’s a GPU?

GPU is short for Graphics processing unit. GPUs are specialized chips that were originally developed to speed up graphics for video games. They do lots of matrix calculations quickly. This is a very handy characteristic for deep learning applications. Fun fact: GPUs are also the tool of choice for cryptocurrency mining for the same reason.

Nvidia P100 GPU

Why use a GPU?

Using a GPU with adequate memory makes training a deep learning network many times faster than using a CPU alone. Because it’s much nicer to get feedback in minutes or hours instead of days and weeks, you’ll want to use a GPU if you are into deep learning. For sure. 😃

Specs

As of early March 2019, Kaggle has upgraded its GPU chip from a Nvidia Tesla K80 to a Nvidia Telsa P100. Colab still gives you a K80. For a brief discussion of Nvida chip types, see my article comparing cloud GPU providers here.

There are a lot of different ways to find info about your hardware. Two useful commands are !nvidia-smi for GPU info and !cat /proc/cpuinfo for CPU info. Even though you want to train your model with a GPU, you’ll also still need a CPU for deep learning.

Any time you use an exclamation point at the start of a Jupyter Notebook code line you are running a bash command. Here’s my article on bash commands, including cat, if you’d like more info about those.

See this Google Sheet for the specs I compiled in the snapshot below.

Memory and disk space can be confusing to measure. The total amount isn’t all available once Colab and Kaggle install their software and start their processes. Here’s a breakdown of the memory discrepancies between the !cat /proc/meminfo profiler command and the Colab and Kaggle widgets.

Total is the total memory. Available is the observed amount of memory available after startup with no additional running processes. You can see that the profiled amounts are close, but don’t line up exactly with the amounts shown in the Colab and Kaggle widgets.

Mouseover in Colab
Kaggle Sidebar

Here’s a Kaggle Kernel and here’s a Colab Notebook with the commands so you can see the specs in your own environment. Make sure you first enable the GPU runtime as shown at the end of this article.

Note that the GPU specs from the command profiler will be returned in Mebibytes — which are almost the same as Megabytes, but not quite. Mebibytes can be converted to Megabytes via Google search — just type in the labeled quantities to convert. Google is everywhere — aren’t they? 😄

The Kaggle widget also shows significantly less disk space than we saw reported. Kaggle could limit how much disk space you can use in your active work environment, regardless of how much is theoretically available.

Kaggle states in their docs that you have 9 hours of execution time. However, the kernel environment shows a max of 6 hours per session in their widget on the right side of the screen. Note that restarting your kernel restarts the clock. Kaggle also restarts your session after 60 minutes of inactivity.

Colab gives you 12 hours of execution time, but also kicks you off if you are idle for more than 90 minutes.

Let’s get to what matters most: how long it takes to do some deep learning on these platforms!

Computer Vision Speed Comparison

I compared Kaggle and Colab on a deep learning image classification task. The goal was to predict whether an image was of a cat or a dog. The dataset consisted of 25,000 images, in equal numbers of cats and dogs. The dataset was split into 23,000 images for training and 2,000 images for validation. The dataset is available on Kaggle here.

Cat and dog images from the dataset

I built a convolutional neural network using the FastAI library and trained it using transfer learning with ResNet30. The model used several tricks for training, including data augmentation and learning rate annealing. Predictions on the test set were made with test-time augmentation. The code was adapted from this FastAI example.

The Kaggle Kernel can be accessed here and the Colab notebook can be accessed here. The batch size was set to 16 and the FastAI version was 1.0.48. The time reported by FastAI’s built-in profiler for several training phases and a prediction phase were summed.

Validation set accuracy was over 99% in all cases. The mean time in minutes for three iterations was 11:17 on Kaggle and 19:54 on Colab. The Kaggle runtime environment was 40% faster than the Colab environment.

Batch Size

I had to drop the batch size from 64 to 16 images to run the image classification successfully in Kaggle. The error with larger batch sizes appears to be caused by the shared memory in the Docker container being set too low. Funny enough, I raised this exact issue with Google Colab in late 2018 — they had it fixed within a week. The same issue remains open with Kaggle as of mid-March 2019.

Next, I ran two iterations with the same code used above on Colab, but changed the batch size to 256. This change resulted an average run time of 18:38. Two additional iterations with a batch size of 64 in Colab resulted in an average time of 18:14. So Colab dropped time with batch sizes larger than 16.

Nonetheless, the smaller batch size wasn’t a huge issue in this task. A wide variety of batch size parameters often works well — for a discussion, see this paper, this post, and this SO answer.

When I trained the model on Colab with a batch size of 256, a warning was raised that I was using most of my 11.17GB of GPU RAM. See below.

This warning is nice, but because of the profiling exercise discussed above I learned about the difference between Gibibytes and Gigabytes. We saw earlier that Colab GPUs have 11.17 Gibibytes (12 Gigabytes) of RAM. So we really have 12 Gigabytes of RAM to use, contrary to what the warning says. Nonetheless, if you’re out of RAM, you’re out of RAM. 😃 So it’s looks like a batch size of 256 is about the max with these image sizes, default number of workers, and 32-bit precision numbers.

Mixed Precision Training

I then tried mixed-precision training in an effort to reduce training time. Mixed precision training means using 16-bit precision numbers rather than 32-bit precision numbers in calculations when possible. Nvidia claims using 16- bit precision can result in twice the throughput with a P100.

Learn about the mixed precision FastAI module here. Note that you need to switch your FastAI Learner object to 32-bit mode prior to predicting with test-time augmentation because torch.stack doesn’t yet support half precision.

By using mixed precision training on Colab, I was able to achieve 16:37 average completion time with a batch size of 16. I tested this over two runs. So we’re dropping time.

However, mixed-precision training increased the total time on Kaggle by a minute and a half, to 12:47! No other specs were changed. Validation set accuracy remained over 99% everywhere.

I found Kaggle’s default packages include slightly older versions of torch and torchvision. Updating the packages to the latest versions that Colab was using had no effect on training time. For what it’s worth, in general, I’ve noticed that the default packages on Colab are updated more quickly than they are on Kaggle.

The hardware differences mentioned above don’t seem likely to cause the reduced performance observed on Kaggle. The only software differences observed are that Kaggle runs CUDA 9.2.148 and cuDNN 7.4.1, while Colab runs CUDA 10.0.130 and cuDNN 7.5.0.

CUDA is Nvidia’s API that gives direct access to the GPU’s virtual instruction set. cuDNN is Nvidia’s library of primitives for deep learning built on CUDA. Kaggle’s software should give a speed boost for a P100, according to this article from Nvidia. However, as seen in the cuDNN change notes, bugs that prevent speed ups are found and fixed regularly.

We’ll have to wait for Kaggle to upgrade CUDA and cuDNN and see if mixed precision training gets faster. For now, if using Kaggle, I still encourage you to try mixed precision training, but it may not give you a speed boost. If using Colab, mixed precision training should work with a CNN with a relatively small batch size.

Let’s look at other aspects of using Colab and Kaggle.

UX

Google is a business that would like you to pay for your GPUs, so it shouldn’t be expected to give away the farm for free. 🐖

Colab and Kaggle have aspects that can be frustrating and slow. For example, both runtimes disconnect more often than one would like. Then you need to rerun your notebooks on restart. 😦

In the past, it wasn’t always guaranteed that you would even get a GPU runtime. It appears they always are available now. If you find one is unavailable, please let me know on Twitter @discdiver.

Let’s look at pros and cons particular to Colab and Kaggle.

Colab

Pros

  • Can save notebooks to Google Drive.
  • You can add notes to notebook cells.
  • Nice integration with GitHub — you can save notebooks directly to GitHub repos.
  • Colab has free TPUs. TPUs are like GPUs, only faster. TPUs are Google’s own custom chips. Unfortunately, TPUs don’t work smoothly with PyTorch yet, despite plans to integrate the two. If the experiment were written in TensorFlow instead of FastAI/PyTorch, then Colab with a TPU would likely be faster than Kaggle with a GPU.

Cons

  • Some users had low shared memory limits in Colab. It appears this issue was resolved for at least one user (discussion here).
  • Working with Google Drive is a bit of a pain. You have to authenticate every session. Also, you can’t unzip files in Drive very easily.
  • Keyboard shortcuts have different bindings that in usual Jupyter Notebooks. Here’s the GitHub issue to follow if you want to see if that changes.

Now let’s look at Kaggle pros and cons.

Kaggle

Pros

  • The Kaggle community is great for learning and demonstrating your skills.
  • Committing your work on Kaggle creates a nice history.
  • Many Jupyter Notebook keyboard shortcuts transfer exactly to the Kaggle environment.
  • Kaggle has many datasets you can import.

Cons

  • Kaggle will generally autosave your work, but if you don’t commit it and then reload your page you might find you lost it all. This is not fun. 😦
  • As discussed above, the PyTorch shared memory in the Docker container is low in Kaggle. This resulted in a RuntimeError: DataLoader worker (pid 41) is killed by signal: Bus error. for the image classification task when the batch size is greater than 16 images.
  • Kaggle Kernels often seem a little laggy.

I don’t know of other cloud providers who provide free GPU time (beyond introductory credits), so this discussion is not meant to be a criticism of Google. Thanks for the free GPUs, Google! 👍 If you know of other folks with free (not just introductory) GPU resources, please let me know.

Conclusion

Both Colab and Kaggle are great resources to start deep learning in the cloud. I find myself using both platforms. You can even download and upload notebooks between the two the two. 😄

It’s been exciting to see Colab and Kaggle add more resources. With a P100 GPU, Kaggle was definitely faster to train and predict than Colab GPU on the image classification task we examined. If you are running an intensive PyTorch project and want a speed boost, it could be worth developing on Kaggle.

If you want to have more flexibility to adjust your batch sizes, you may want to use Colab. With Colab you can also save your models and data to Google Drive, although the process can be a bit frustrating. If you are using TensorFlow, you might want to use TPUs on Colab.

If you need more power or more time for longer-running processes, my previous experiments suggest Google Cloud Platform is the most cost-effective cloud solution.

I hope you’ve found this comparison of Colab and Kaggle useful. If you have, please share it on your favorite social media channel so others can find it, too. 👏

I write about Python, dev ops, data science, and other tech topics. If any of those are of interest to you check them out here:

Happy deep learning!

Head down either path you like 😄