How to use Google Cloud TPUs?



TPU V2 (Source: Google Cloud Platform Blog)

During the summer of 2018, I participated the Deep Learning Camp Jeju 2018, a month-long deep learning research camp organized by TensorFlow Korea Group and sponsored by Google and four other AI giants in South Korea. On top of providing full accommodation support and wonderful mentors, the camp also provided 1,000 USD worth of Google Cloud credits with access to 5 TPUs. In the first half of the camp, my teammate and I used Cloud GPUs to train our speech synthesis model. However, realizing that the model required hard core training of four days with normal GPUs, we hurriedly transferred our model to TPUs. Our decision turned out to be wise: one TPU was 3~4x faster than four of Nvidia’s latest Tesla V100 accelerators. Thus, along with sharing the results of our research on GitHub and arXiv, we decided to share our wonderful experience of using TPUs during the camp period. This is a how-to-guide of TPUs for researchers who gained access to Google Cloud TPUs.

Deep Learning Camp Jeju 2018 (Source: Terry’s Facebook)

What are TPUs?

“We’ve designed, built and deployed a family of Tensor Processing Units, or TPUs, to allow us to support larger and larger amounts of machine learning computation, first internally and now externally.” — Google Cloud Platform Blog

Moore’s Law is ending, said John Hennessy in Google I/O ’18. He claimed that domain specific architectures are the solution to this problem. He explained that we can “achieve higher efficiency by tailoring architecture to characteristics of domain.” In the context of deep neural network training, that domain specific architecture is Tensor Processing Units, or TPUs. Developed by Google, TPU showed its calculation capacity through Deep Mind’s AlphaGo program during its Go battle with Lee Sedol, one of the world’s top Go players.

TPU V2 Pods (Source: Google Cloud Platform Blog)

This computational prowess of TPUs was possible mainly because of their three decisive features. One, TPUs eliminated unneeded accuracy while performing training and inference. Two, TPUs optimized large, hard-wired matrix calculation without memory access. Three, TPUs assumed a minimal and deterministic design where all unnecessary functions such as caching and branch prediction were removed. Such optimized TPUs are deployed on TPU pods, supercomputers specifically developed for machine learning.

Nevertheless, using TPUs does not guarantee high performance in all cases. Only when you need tons of matrix calculation should you use it. Or else the efforts you put to convert your codes to TPU environment are not worth the performance you obtain. Speaking of efforts, using TPUs is a tricky task at this moment where TPU development is still in progress with various inconvenience and barriers making people return to their GPU machines. The foremost barrier is the conversion of code in order to use TPU. The overall structure of code for TPU usage is different from that for GPUs and CPUs. However, the only viable option is to refer to the example codes that TensorFlow released. If your model is not one of them, you might have to implement by yourselves the structures that you merely know of. Additionally, discovering the cause of errors is much difficult when using TPUs, since TPU workflows are hardly transparent and there are not enough resources on the web that you can receive aid from.

How to use TPUs?

Take the following steps to set up TPUs for machine learning training. Here, I will use the DCGAN code for TPU that TensorFlow released.

  1. Set Google Cloud environment.
  2. Set Cloud Storage Buckets.
  3. Prepare the codes.

Using TPU is possible only with Google’s permission at this moment. This can be checked in [console]-[Compute Engine]-[quotas]. Also note that each TPU device has eight cores.

Check TPU quotas (Source: Me)

1. Set Google Cloud environment.

I will assume that you have already created a GCP project and enabled billing for it. To set the environment, you can either use the Google Cloud Shell, which is web-based and requires no further installation, or the Google Cloud SDK, which is terminal-based and can be installed here. In this post, I will choose the latter.

The first step is to create a virtual machine and connect it to the local machine via SSH. Before you do this, make sure you correctly know your project name and TPU zone that you have access to. Now type this to your command line.

# configure your project name and TPU zone
gcloud config set project [PROJECT_NAME]
gcloud config set compute/zone [ZONE]
# create a virtual machine
gcloud compute instances create [VM_NAME] \
--machine-type=n1-standard-2 \
--image-project=ml-images \
--image-family=tf-1-8 \
--scopes=cloud-platform

To check whether your virtual machine is successfully created, go to [console] →[Compute Engine] →[VM instances].

Check VM instances

Now connect your VM in the terminal.

# connect to your VM
gcloud compute ssh [VM_NAME]

Unlike using Cloud GPUs, there is no need to install TensorFlow or TPU version of CUDA/cuDNN. Instead, create the TPU device through your VM’s terminal. Just like creating VM, you need to correctly know your project name and TPU zone. After such configuration, you also need to set the IP address for your TPU. There is a certain format for the IP address, so set the first two numbers as in the example: 10.240.X.X.

# configure your project name and TPU zone
gcloud config set project [PROJECT_NAME]
gcloud config set compute/zone [ZONE]
# set TPU's ip address
export TPU_IP=10.240.6.2
# create TPU instance
gcloud alpha compute tpus create $USER-tpu \
--range=${TPU_IP/%2/0}/29 --version=1.8 --network=default
# check TPU list
gcloud alpha compute tpus list

You can also check the TPU creation on the web console: [console] →[Compute Engine] →[TPUs]. If the creation is successful, you can automatically use TPUs on the VM.

2. Set Cloud Storage Buckets.

When using TPUs, it is necessary to use Cloud Storage Buckets rather than to feed data from the machine itself. This is to ensure TPU’s high performance by optimizing the input pipeline. For this reason, the VM that is created in this tutorial has only 10 GB of storage.

Setting Storage Buckets is possible either by the terminal or by the web. Here, I will introduce the latter. Nonetheless, the former option — using gsutil, which is installed with Google Cloud SDK — is equally convenient. On the web console, go to [console] →[Storage] →[Browser]. Firstly, create a bucket and upload your data files. Note that the location of the bucket has no relation with the TPU zone. There is only a minute price difference between the location, so feel free to choose. For example, I chose [Regional] →[US-CENTRAL1].

Create buckets

Now your bucket will be referred to as ‘gs://[BUCKET_NAME]’ when calling the directory. Treat it as a normal path: feel free to add subdirectories at the end of the path.

The training data that you upload must be in the tf.dataset.TFRecord format for efficient pipelines……!!!!!!!

3. Prepare the codes.

how to move from local to machine; how codes should be changed; TPU workflow;

4. Check out the results.

tensorboard, checkpoint saving, backup etc…

Conclusion

dfsdf

References

Source: Deep Learning on Medium