Setting up a TPU node in Google cloud — step by step instructions

Source: Deep Learning on Medium

These days some models are being released with implementations running only on TPUs (e.g. text-to-text transformer )

Logging my experience setting up a Google cloud TPU node for running machine learning models.

The overall sequence of steps are

  1. Create Google cloud account if we don’t have one.
  2. Create a project.
  3. Create VM instance with a single CPU or GPU, OS, hard disk space, CPU memory, etc.
  4. Install tool (ctpu) to create, manage and delete TPU instances. This step would require authorization which can be done using gcloud (comes installed by default in VM instances).
  5. Setup google cloud storage bucket
  6. After using TPUs remember to release tpu instance, bucket storage (after saving output if needed), and VM instance

Instructions for steps 1–3 are covered in the article for setting up VM instances.

4. Installing and using ctpu

Fetch ctpu and include it location in PATH.

wget https://dl.google.com/cloud_tpu/ctpu/latest/linux/ctpu && chmod a+x ctpu

This command ctpu enables us to provision, manage and delete TPUs. However we need to enable this command to be authorized to perform these operations. This can be done by using gcloud as shown below. Type the command below and follow instructions to copy-paste the key from browser

gcloud auth application-default login

The simplest usage of ctpu is to create a ctpu instance, check status, and to delete as shown below

ctpu up --tpu-onlyctpu statusctpu delete

In most cases however, we may need to specify additional options. For instance models may require bringing up TPU with specific versions of Tensorflow etc. (e.g. text-to-text transformer, the up command would be specified with additional options. )

ctpu up --name=$TPU_NAME --project=$PROJECT --zone=$ZONE --tpu-size=v3-8 --tpu-only --tf-version=1.15.dev20190821 --noconf

5. Setting up and managing bucket storage

To run a model, we may have to copy the model either to our VM instance or store the model in a bucket. We can use gsutil (also comes installed by default on VM instances)

gsutil mb gs://<your bucket name>gsutil cp <source> gs://<your bucket name>gsutil rm -r gs://<your bucket name>

6. Deleting instances

Finally when our work is done, we can delete our TPU, bucket storage and VM instance

ctpu delete gsutil rm -r gs://<your bucket name>

One anomalous behavior of ctpu is that it may not show the active TPU node we provisioned, when checking on status. This happens at times when we procure TPUs either from a VM instance or from Google cloud shell.

We can delete the TPU instance by the delete option in the web interface for Google cloud platform. VM instance can also be deleted from the Google cloud platform web interface.

One could potentially avoid creating a VM instance in the first place by just using a Google cloud shell, but this may not be possible in all cases due to memory/resource constraints in google cloud shell.

Lastly, the VM instance we provision could be just a CPU or a GPU based on the model requirements.