The 5 petaflops Nvidia DGX A100 hopes to run your AI workloads

Original article was published on The World’s Number One Portal for Artificial Intelligence in Business

While HGX will do it for the cloud

by Sebastian Moss 14 May 2020

What do you get if you take eight of Nvidia’s new A100 GPUs, a dual 64-core AMD Rome CPU, six NVSwitches, 15TB of Gen 4 NVME SSD, nine Mellanox 200Gbps Network interfaces, and package them all together?

Well, a bill for $199,000 – but also a lot of AI performance. Nvidia’s latest DGX reference architecture is the company’s preferred approach to shipping its highest performance chips.

The DGX A100, as the most recent iteration is named, is capable of five petaflops of FP16 performance, or 2.5 petaflops TF32, and 156 teraflops FP64. It also runs at 10 petaops (not flops) with INT8.

A DGX A100 being installed
© Argonne/Nvidia

AI ready

“Nvidia DGX A100 is the ultimate instrument for advancing AI,” Jensen Huang, the ebullient company CEO, said as he unveiled the product during Nvidia’s now-virtual GTC.

“Nvidia DGX is the first AI system built for the end-to-end machine learning workflow – from data analytics to training to inference. And with the giant performance leap of the new DGX, machine learning engineers can stay ahead of the exponentially growing size of AI models and data.”

Among the first customers of the DGX, which has 320GB of memory for training large AI datasets, is the Argonne National Laboratory. Rick Stevens, associate laboratory director at the Department of Energy facility, said that the system would be used “in the fight against COVID-19.”

He added: “The compute power of the new DGX A100 systems coming to Argonne will help researchers explore treatments and vaccines and study the spread of the virus, enabling scientists to do years’ worth of AI-accelerated work in months or days.”

Nvidia has also released a version of the DGX on steroids: the DGX SuperPOD reference architecture. It’s 140 DGX A100 systems all clustered together, capable of 700 petaflops of ‘AI computing power.’

So far, the SuperPOD has just one customer: Nvidia. The company plans to install four of the pods as part of its internal Saturn V supercomputer, adding 2.8 exaflops of AI computing power, for a total of 4.6 exaflops. 

For cloud computing companies like Amazon Web Services, Google, and Microsoft Azure, there’s a slightly smaller option: The HGX A100.

It will feature four A100s, instead of the DGX’s eight.

Moving further down the power scale is the EGX A100, with just one GPU and a Mellanox ConnectX-6 SmartNIC, targeting the edge market.