Source: Deep Learning on Medium
NVIDIA RTX 20 Series: A Deep Dive into the Next-Gen GeForce Architecture with Ray Tracing and AI
Ray Tracing, otherwise marketed as RTX is one of the core features of NVIDIA’s RTX 20 series AKA Turing GPU architecture. Although there have been other improvements to the graphics core, the inclusion of the new RTCores along with the Tensor cores is what really stands out.
For the first time ever, NVIDIA has enabled the use of real-time ray tracing in PC gaming…at least for the lighting and reflections. How this was achieved along with the other core features of the new RTX or Turing architecture, let’s find out.
The TU102: Exploring the Turing Flagship
When a new generation is planned, there are usually just 2–3 main GPUs in the pipeline. These are codenamed and their various cut down versions form the consumer lineup, with the original form usually used in the flagship model. For the Pascal family (GTX 10 series), this was the GP102, powering the GTX 1080 Ti and the Titan X/Xp. The top-end GPU block of the Turing lineup is the TU102. Here are its specs:
- CUDA Cores: 4,608
- RT Cores: 72
- Tensor Cores: 576
- Texture Units: 288
- Memory Config: 384-bit (32-bit x 12)
- ROPs: 96
- L2 Cache: 6144 KB (512KB x 12)
Like previous generations, it’s the Titan that includes the full-scale x102 die. The x80 Ti is a cut-down variant that sells for a relatively lower price, serving the enthusiast gamers. In this case, that would be the RTX 2080 Ti. Although it’s not exactly cheap at $1,000, it doesn’t feature the entire TU102 die. That honor goes to the Titan RTX.
Tne 2080 Ti loses four SMs and the accompanying cores, Tensors, RT and the standard shaders, included. One memory controller is also disabled along with eight render output units. You can compare the cut-down TU102 die used in the Ti to the Quadro RTX 6000 in the specs tables above and below.
The Pascal SM (Streaming Multiprocessor)
With every new GPU microarchitecture, NVIDIA always redesigns (or rather rearranges) its SM or Streaming Multiprocessor. The core count per block is increased or decreased, the arrangement of the Load/Store Units and the Special Function Units is changed and at times the cache configuration is overhauled. This time all three have been changed:
The Turing SM vastly different from the Pascal SM. Like the latter, the former is also partitioned into four blocks. Each block consists of three independent pipelines, namely INT32 (integer add, subtract, compare), FP32 (fused multiplication and addition: FMA and FAA) and the Tensor Cores for AI and Deep Learning related workloads.
Have a look at the Pascal SM. The integer and FP32 cores aren’t differentiated. As such, at a time, only an integer or floating-point operation was executed per path. In the case of FP operations, this would leave the INT cores idle and vice versa.
The Turing SM has three different paths or pipelines per block, one for each type of operation. This allows the integer instructions to execute in parallel with the Floating-Point Arithmetic. NVIDIA claims that this concurrent execution has improved the performance in-game by up to 36%.