GPU and its journey towards Deep Learning

Source: Deep Learning on Medium

Go to the profile of Shubham Gupta

I have been always been a gaming enthusiast and now I am Data Scientist. This week I was curious how GPU made itself in Deep Learning space and how GPU is playing a crucial role in Deep Learning more than ever since the start of this decade. But why is that? How a processing unit designed only for rendering graphics suddenly became boon for Deep Learning engineers and researchers? Lets take a look at it’s journey.

GPU and it’s complex relation with CPU

Wikipedia defines GPU as:

A graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device.

So GPU’s intended function is to accelerate the creation of images for an output display. In early iterations of gaming consoles, developers used to face frame buffer issues due to expensive RAM on chipsets. So they introduced a parallel processing unit called video graphics cards for rendering so there would not be a greater need of buffers. Developers usually follow the given pipeline:

  1. Vertex Processing: computing the geometry of scene relative to camera
  2. Primitive Processing: collecting vertex processing output and composing them into simple primitives(like lines, points or triangles)
  3. Rasterization: each primitive is broken down into discrete elements called fragments
  4. Fragment Generation: here a fragment shader lists colors, provide depth value and stencil value to the fragments
  5. Pixel Generation: the buffer is composed here into pixel generation.

All these operation from step 1 to 4 require massive vector operations. While CPU and GPU perform similar functions in their respective ALUs(Arithmetic Logical Unit), CPU is much more slower with its larger pool of RAM when compared with GPU which contain smaller amounts of more expensive memory that is much faster to access. Transferring portions of data set to GPU memory results in speed increase. Hence a proper format of data in form of 2D and 3D data is transferred to GPU memory for the operations. Another big advantage of modern day GPUs is their multiple core and threads architecture over CPUs’ few cores and bigger cache memory. A parallel processing platform with multiple cores, and fast and small caches is a big reason to use GPU for data intensive operation.

The advantage of this hardware accelerator were well known but GPUs require particular programming skill and are time consuming.

GPGPU in mainstream

General-purpose computing on GPUs (GPGPU) is taking advantage of GPUs to execute computation that are usually handled by CPUs. Scientific community has been working hands to use GPUs for scientific computations. But a bigger revolution came into picture when NVidia introduced CUDA, which allowed developers to perform intense computational task on GPU without worrying about the graphical processing. CUDA was later followed by Microsoft’s DirectCompute and Apple/Khronos Group’s more generic OpenCL. This allowed developers and researchers to exploit the underlying computational power of GPU in late 2000s.

CUDA Architecture (Credits: Wikipedia)

The GPU is optimized for a high computational power and a high throughput. Both are needed for the graphics processing. The basic building block of a GPU is the Streaming Multiprocessor (SM). The SMs consist of many ALUs, also called CUDA-Cores in NVidia GPUs. Each SM can run one warp (a bundle of 32 threads) at a time. Each thread in a warp performs the same operation. NVidia calls this model Single Instruction Multiple Threads (SIMT) with each thread being able to access the memory location with it’s own thread id.

Also, the execution model of a GPU can be described as Bulk Synchronous Parallel (BSP) model. An important part of analyzing a BSP algorithm rests on quantifying the synchronization and communication needed while invoking more threads than resources avail to overcome parallel slackness.

Map, Reduce, Stream filtering, Scan, Scatter, Gather, Sort, Search and Data Structures like Dense arrays are common and most used methods on GPUs. Since I am mentioning about operations it would be fair to share applications of GPGPUs which include:

  • Newtonian Physics Simulator
  • Ray Tracing
  • Statistical Physics
  • Digital Signal Processing
  • Fuzzy Logic
  • kNNs
  • Weather Simulations and Forecasting
  • Cryptography and Cryptanalysis
  • Blockchain Verification

Deep Learning with GPUs

While training model/network for Deep Learning, heavy computational power is required while working with objective functions and the optimizer. Multiple forward and backward propagation need much operational power as well. Here developers tried to capitalize large amount of parallelism from GPUs. The most common kernels are matrix multiply, convolution and neurons with zero data dependencies.

A simple DAG for detecting MNIST

Tensorflow, Theano, and CNTK will keep improving user experience by allowing them tensor/vector operations through GPUs without worrying about their underlying principles. And libraries like keras, pytorch, H2O and gensim are making life of engineers much easier for faster implementation.

On an average there are almost hundreds of thousands of parameters and weights that need to be calculated while training the standard Neural Networks in the industry which you can see is nothing but a mammoth task. To compare, NVidia tried to replicate Google’s multi million dollar project Google Brain having 1K CPU cores. By using only 3GPUs with 18K Cores of $33K, NVidia did it!

Design credit: xkcd


Now there are multiple libraries that are being used to develop reusable codes to exploit GPUs. But there are few limitations as well. Entry level GPUs have limited on-board memory but introduction of new age architecture and IaaS (Infrastructure as a Service), multiple high capacity GPUs can be used which is again pocket heavy.

These days companies and manufacturers provide other forms of hardware accelerators for deep learning on the go. Like AI powered chipsets in mobile, integrated less powered atom chipsets on processor die, deep learning powered self driving cars and powerful image processing GPUs for a better picture quality in cameras and displays. They are low powered hardwares and only called when a rigorous amount of processing is required. Google and NVidia have introduced their dedicated GPU chipsets for Tensor Operations which have set the computational bar really high.

The basic functions of the Deep Learning are extensive to all kinds of input data, which makes these network robust and popular with GPUs’ parallel computation. We can fairly say that GPUs are not going anywhere soon.