Benchmarking Google’s new TPUv2

For most of us, deep learning still happens on Nvidia GPUs. There is currently no alternative with practical relevance. Google’s Tensor Processing Unit (TPU), a custom-developed chip for deep learning, promises to change that.

Nine months after the initial announcement, Google last week finally released TPUv2 to early beta users on the Google Cloud Platform. At RiseML, we got our hands on them and ran a couple of quick benchmarks. Below, we’d like to share our experience and preliminary results.

More competition in the market for deep learning hardware has been long sought after and has the potential of breaking up Nvidia’s monopoly on hardware for deep learning. Along with that, this will define what the deep learning infrastructure of the future will look like.

Keep in mind that TPUs are still in early beta — as unmistakingly communicated by Google in many places — so some of the things we discuss might change in the future.

TPUs on the Google Cloud

While the first generation of chips, TPUv1, were geared towards inference, the second and current generation is focused on speeding up learning in the first place. At the core of the TPUv2, a systolic array is responsible for performing matrix multiplications, which are used heavily in deep learning. According to Jeff Dean’s slides, each Cloud TPU device consists of four “TPUv2 Chips”. Each chip has 16GB of memory with two cores, each with two matrix multiplication units. Together, the two cores provide 45 TFLOPs, totalling 180 TFLOPs and 64GB of memory for the whole TPU device. To put this into perspective, the current generation of Nvidia V100 GPUs provides 125 TFLOPs and 16GB of memory.

To use TPUs on the Google Cloud Platform, you need to start a Cloud TPU (after obtaining quota to do so). There is no need (or way) to assign a Cloud TPU to a specific VM instance. Instead, discovery of the Cloud TPU from your instance happens via network. Each Cloud TPU is assigned a name and gets an IP address that you need to provide to your TensorFlow code.

Creating a new Cloud TPU. Note that a Cloud TPU has an IP address.

TPUs are only supported by TensorFlow version 1.6, which is available as a release candidate. Besides that, you don’t need any drivers on your VM instance since all of the required code for communicating with the TPU is provided by TensorFlow itself. Code that is executed on the TPU is optimized and just-in-time compiled by XLA, which is also part of TensorFlow.

In order to efficiently use TPUs, your code should build on the high-level Estimator abstraction. You can then drop in a TPUEstimator which performs a lot of the necessary tasks for making efficient use of the TPU, e.g., it sets up data-queueing to the TPU and parallelizes the computation across its different cores. There is certainly a way around using the TPUEstimator, but we are currently unaware of an example or documentation.

Once you’ve set up everything, run your TensorFlow code as usual and the TPU will be discovered during start-up and the computation graph is compiled and transferred to the TPU. Interestingly, the TPU can also directly read and write from cloud storage to store checkpoints or event summaries. To allow this, you need to provide the service account behind the Cloud TPU write access to your cloud storage.


The interesting part is, of course, how fast TPUs really are. TensorFlow has a GitHup repository of models for TPUs that are known to work well. Below, we report on experiments with ResNet and Inception. We were also keen to see how a model that is not yet optimized for TPUs performs, so we adapted a model for text classification using LSTMs to run on TPUs. In general, Google recommends to use larger models (see when to use TPUs). This is a smaller model, so it was especially interesting to see if TPUs could still provide a benefit.

For all models, we compared training speed on a single Cloud TPU to a single Nvidia P100 and V100 GPU. We note that a thorough comparison should also include final quality and convergence of the model in addition to mere throughput. Our experiments are meant as a first peek and we will leave an in-detail analysis to future work.

Experiments for TPUs and P100 were run on Google Cloud Platform on n1-standard-16 instances (16 vCPUs Intel Haswell, 60 GB memory). For the V100 GPU, we used p3.2xlarge (8 vCPUs, 60 GB memory) instances on AWS. All systems were running Ubuntu 16.04. For TPUs, we installed TensorFlow 1.6.0-rc1 from the PyPi repository. GPU experiments were run using nvidia-docker using TensorFlow 1.5 images (tensorflow:1.5.0-gpu-py3) that include CUDA 9.0 and cuDNN 7.0.

TPU-optimized Models

Let’s first look at the performance of models that are officially optimized for TPUs. Below, you can see the performance in terms of images per second.

Batch sizes were 1024 for TPU and 128 for GPUs. For GPUs, we used the implementations from the TensorFlow benchmarks repository. Training data was the fake Imagenet dataset provided by Google stored on cloud storage (for TPUs) and on local disks (for GPUs).

On ResNet-50, a single Cloud TPU (containing 8 cores and 64GB of RAM) is ~8.4 faster than a single P100 and ~5.1 times faster than a V100. For InceptionV3, the speedup is almost the same (~8.4 and ~4.8, respectively). With smaller precision (fp16), the V100 gains a lot of speed.

Clearly, beyond just speed, one has to take price into account. The table shows the performance normalized for on-demand pricing with per-second billing. The TPU still comes out clearly ahead.

Custom LSTM Model

Our custom model is a bi-directional LSTM for text classification with 1024 hidden units. LSTMs are a basic building block in NLP nowadays so this nicely contrasts the official models, which are all computer vision based.

The original code was already using the Estimator framework, so adapting it to use TPUEstimator was very straightforward. There is one big disclaimer though: on TPUs we couldn’t get the model to converge whereas the same model (batch size, etc.) on GPUs worked fine. We think this is due to a bug that will be fixed — either in our code (if you find one, please let us know!) or in TensorFlow.

It turns out that the TPU is even faster on the LSTM model: ~12.9 times faster than a P100 and ~7.7 times faster than a V100! Given that the model is comparably small and was not tuned in any way this is a very promising speedup. As long as the bug is not fixed please consider the results preliminary.


On the models we tested, TPUs compare very well, both, performance-wise and economically, to the latest generations of GPUs. This stands in contrast to previous reports. While Google promotes TPUs for being optimal for scaling larger models, our preliminary results for a small model also look very promising. Overall, the experience of using TPUs and adapting TensorFlow code is already pretty good for a beta.

We think that once TPUs are available to a larger audience, they could become a real alternative to Nvidia GPUs.

Contact me at:

Benchmarking Google’s new TPUv2 was originally published in RiseML Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium