Google Coral Edge TPU vs NVIDIA Jetson Nano: A quick deep dive into EdgeAI performance

Source: Deep Learning on Medium


Go to the profile of Sam Sterckval

Recently I’ve been reading, testing, and writing a bit about edge computing (like here, and here), with the main focus on edge AI. With cool new hardware hitting the shelfs recently, I was eager to compare performances of the new platforms, and even test them against high performance systems.

The Hardware

The main devices I’m interested in are the new NVIDIA Jetson Nano(128CUDA)and the Google Coral Edge TPU (USB Accelerator), and I will also be testing an i7-7700K + GTX1080(2560CUDA), a Raspberry Pi 3B+, and my own old workhorse, a 2014 macbook pro, containing an i7–4870HQ(without CUDA enabled cored).

The Software

I will be using MobileNetV2 as a classifier, pre trainend on the imagenet dataset. I use this model straight from Keras, which I use with TensorFlow backend. With the floating point weights for the GPU’s, and an 8-bit quantised tflite version of this for the CPU’s and the Coral Edge TPU. (If it is unclear to you why I don’t use an 8-bit model for the GPU’s, keep on reading, I will talk about this). First, the model and an image of a magpie are loaded. I then execute 1 prediction as a warmup (because I noticed the first prediction was always a lot slower then all the next ones). I let it sleep for 1s, so that all threads are certainly finished. Then the script goes for it, and does 250 classifications of that same image. By using the same image for all classifications, we assure that it will stay close to the databus throughout the test. After all, we are interested in inference speeds, not the ability to load random data faster.

Magpie, If you’re from GB, don’t forget to salute!

Straight to the point

Nobody likes waiting, and let’s be honest, most of you will mainly be interested in the results, so here we go :

The scoring with the quantized tflite model for CPU was different, but it always seemed to return the same prediction as the others, so I guess that’s something weird in the model, and I’m pretty sure it doesn’t affect performance.

Now, because the results are so different for different platforms, it’s kind of hard to visualise, so here are a few graphs, choose your favourite…

Linear scale, FPS
Logarithmic scale, FPS
Linear scale, inference time (250x)

Analysis

Straight away, there are 3 bars in the first graph that jump into view. (Yes, the first graph, linear scale fps, is my favourite, because it shows the difference in the high performance results) Of these 3 bars, 2 of them where achieved by the Google Coral Edge TPU USB accelerator, and the 3rd one was a full blown NVIDIA GTX1080 assisted by an Intel i7–7700K. Look a bit closer, and you’ll see the GTX1080 actually got beaten by the Coral. Let that sink in for a few seconds, and then prepare to be blown away, because that GTX1080 draws a maximum of 180W, which is absolutely HUGE compared to the Corals 2.5W.


You managed to stand up again already? Ok, let’s go on:

Next thing we see, is that the NVIDIA Jetson Nano isn’t scoring good at all. Although it has a CUDA enabled GPU, it’s really not much faster then my old i7–4870HQ. But that’s the catch, ‘not much faster’, it still is faster then a 50W, quad-core, hyperthreading CPU. From a few years back, true, but still. The Jetson Nano never could have consumed more then a short term average of 12.5W, because that’s what I’m powering it with. That’s a 75% power reduction, with a 10% performance increase.

Clearly, the Raspberry Pi on it’s own isn’t anything impressive, not with the floating point model, and still not really anything useful with the quantised model. But hey, I had the files ready anyway, and it was capable of running the tests, so more is always better right? And still kind of interesting because it shows the difference between the ARM Cortex A53 in the Pi, and the A57 in the Jetson Nano.

Source:NVIDIA

NVIDIA Jetson Nano

So the Jetson Nano isn’t pumping out impressive FPS rates with the MobileNetV2 classifier, but as I already stated, that doesn’t mean it isn’t a great piece of useful engineering. It’s cheap, it doesn’t need a shitload of energy to run, and maybe the most important property is that it runs TensorFlow-gpu (or any other ML platform) like any other machine you’ve always been using before. As long as your script isn’t diving too deep into CPU architectures, you can run the exact same script you would on an i7+CUDA GPU, also for training! I do still feel like NVIDIA should preload L4T with TensorFlow, but I’ll try not to rage about this any longer. After all, they have a nice explanation on how to install it (don’t be fooled though, TensorFlow 1.12 is not supported, only 1.13.1).

Coral USB Accelerator

Google Coral Edge TPU

Ok I have a big love for nicely engineered and high efficiency specific electronic devices, so I’m maybe not perfectly objective. But this thing… It’s a thing of absolute beauty!

Penny for scale, source:Google

The Edge TPU is what we call an “ASIC” (Application Specific Integrated Circuit), which means that it has a combination of small electronic parts such as FET’s and capacities burned directly on the silicon layer, in such a way that it does exactly what it needs to do to speed up inference.

Inference, yes, the Edge TPU is not able to perform backwards propagation.

The logic behind this sounds more complex then it is though. (Actually creating the hardware, and making it work, is a whole different thing, and is very, very complex. But the logic functions are much simpler). Next image shows the basic principle around which the Edge TPU has been designed.

Source:Google

A net like MobileNetV2 is consisting mostly of convolutions with activation layers behind. A convolution is stated as :

Convolution

Which means nothing more then multiplying each element(pixel) of the image with every pixel of the kernel, and then adding these results up, to create a new ‘image’(feature map). That is exactly what the main component of the Edge TPU was meant for. Multiplying everything at the same time, then adding it all up at insane speeds. There is no ‘CPU’ behind this, it just does that whenever you pump data into the buffers on the left. If you’re really interested in how this works, look up “Digital Circuit” and “FPGA”, and you’ll probably find enough information to keep you busy for the next few months. Sometimes rather complex to start with, but really really interesting!

But this is exactly why the Coral is in such a different league when comparing performance/Watt numbers, it is a bunch of electronics, designed to do exactly the bitwise operations needed, basically no overhead at all.

Internal schematic of a Google Cloud TPU — Source:Google

Why no 8-bit model for GPU?

A GPU is inherently designed as a fine grained parallel float calculator. So using floats is exactly what it was created for, and what it is good at. The Edge TPU has been designed to do 8-bit stuff, and CPU’s have clever ways of being faster with 8-bit stuff then full bitwitdh floats because they have to deal with this in a lot of cases.

Why MobileNetV2?

I could give you a lot of reason’s why MobileNetV2 is a good model, but the main reason is, it’s one of the pre-compiled models that Google made available for the Edge TPU.

What else is available on the Edge TPU?

It used to be just MobileNet and Inception in their different versions, but as of the end of last week, Google pushed an update which allowed us to compile custom TensorFlow Lite models. But the limit is, and will probably always be, TensorFlow Lite models. That is different with the Jetson Nano, that thing runs anything you can imagine.

Raspberry Pi + Coral vs the rest

Why does the Coral seem so much slower when connected to a Raspberry Pi? Answer is simple and straight forward : Raspberry Pi has only USB 2.0 ports, the rest has USB 3.0 ports. And since we can see the i7–7700K is faster with the Coral then the Jetson Nano, but still doesn’t seem to score as good as the Coral Dev Board did when NVIDIA tested it, we can conclude the bottleneck is data rate, and not the Edge TPU.

Source:NVIDIA

Fading away

Ok, I’m the last one left in the office by now, I think this has been long enough for me, and probably for you as well. I have been absolutely blown away by the power of the Google Coral Edge TPU. But to me, the most interesting setup here was the NVIDIA Jetson Nano in combination with the Coral USB Accelerator. I will most certainly use that setup, it feels like a dream to work with.

I hope you had an interesting read. If there are any remarks or questions, do not hesitate to contact me. As usual, this is also where I tell you I will probably write something new soon, so yeah, keep your eyes open and all that. Cheers!