Source: Deep Learning on Medium
What’s inside the Google Coral Edge TPU? Speed Test & Teardown
Earlier this year Google finally released TPU hardware that you can own via their Coral brand. However these are not the beefcake cloud TPUs training networks like BigGAN at 100+ petaflop/s for a week or even the cheapest 180 TFlop/s v2 TPU that you can rent at $4.50/hour on demand. These are TPU devices meant to work “at the edge,” that is to deliver deep learning solutions in the field on small devices without internet. So how good are the TPUs you can own?
The specs of all the current TPU products you can buy is 4 TOPs. The units aren’t the usual 32 bit floating point because the tensorflow-lite that runs on these devices uses 8 bit fixed point. You also won’t find memory bandwidth like on cloud TPUs or GPUs at 600+ GB/s. Edge TPUs are connected via USB 3.0 or a single mPCIe lane (gen 2) so 640 or 500 MB/s. Of course, since there is only 8MB of SRAM on the edge TPU this means at most 16ms are spent transferring a model to device, and in the model used in this post, it took just 10ms. You won’t miss the memory because these devices are not for from-scratch training. But you can train the last layer of a model on this device! The main usecase of Edge TPUs is inference, which is how I will be benchmarking.
The most self contained USB3.0 “Coral Edge TPU” device came out in March of this year, and this is what I got to play around with. It’s one of Coral’s prototyping products, though on paper it has the same exact performance as their mPCIe production boards. USB 3.0 is easy, but if I didn’t want to deal with the external dongle, the M.2 Accelerator A+E key could replace my wifi card internally:
So what’s inside the $75 30mm x 65 mm TPU Edge Accelerator? Could it be the $35 30mm x 22mm mPCIe Accelerator with a USB adapter and a heat sink? Can we open the plastic + metal snap safely?
The plastic shell can be pried off without too much struggle, thanks to the heat sink being very stiff. And the heat sink can be removed by undoing 4 screws.
The answer to the first question is no, not literally, but the boards are very similar:
The two chips in contact with the heatsink are probably the TPU and the memory, with the larger one being the TPU. After I put the Edge TPU back together I did a little speed testing of my own, though Google’s benchmarks say between 10 and 20 x speedup over a CPU.
The Coral example for detection can be run on CPU and TPU alike. The computer that I’m attaching the TPU to is my Acer Chromebook 11 running Galium OS 2.1, which is enough like debian so that there are no problems installing tf-lite and the edge TPU runtime. The MobileNetV2 SSDLite models for detection are less than 7MB and operate on 300×300 images. On the TPU a single image runs in 20ms (+10ms model copy time on the first iteration). On my laptop’s Intel Celeron 2.16GHz CPU from 2014 a single image is 1500ms. On another PC with Intel Xeon 2.5 GHz CPU (which cpubenchmark.net predicts is 15x faster) a single image is 130ms (at 210 watts!). The best CPU available would be 3x faster than this according to cpubenchmark.net: still slower than the edge TPU which is much cheaper in $$$ and power!
A little more exciting I think is detection on the live streaming camera data on my laptop. On the Celeron, this example runs at less than 1 frame per second. On the TPU, it runs at 20 frames per second: realtime! And only a quarter of the time is spent on the inference on TPU, the rest is resizing images and drawing on the output which must be done on the CPU. Though I did install the max operating frequency TPU runtime, I’m not near the operating limit of the TPU, so heat is no problem: the TPU’s heatsink doesn’t burn to the touch.
These examples were easy to use/tweak after just a few minutes of installation on linux. All you need to get going with this device is to apt-get install the TPU runtime, pip install tf-lite, and then you can jump straight into the examples. The models in these examples were pre-compiled. A truer speed test would be to train a model from scratch, execute that on the GPU and CPU, then convert it to tf-lite, and execute that on CPU and TPU.
Overall I’m excited about this device. I think it is amazing that a 300×300 object detection network runs on a netbook from 2014 in realtime with a $75 modification ($35 if I got the M.2 board). Sure, the edge TPU won’t replace any GPUs for training, or even for model evaluation. But I see a lot of potential in the 10 USD/TOP, 0.5 watt/TOP board. This is at least 10x cheaper per OP than a GPU! And much less power hungry. For traveling demos this would be great. As long as the operations you’re interested in are in this table, it could be a useful chip to program for when low power and lightweight systems are a priority.