All Things GPU: Part 2

Source: Deep Learning on Medium

OpenCL

By contrast, Open Computing Language was released by the Khronos Group in 2009 and generally relies on external host code which manages buffers and JIT compiles kernels at runtime. Instead of CUDA warps, OpenCL calls them wavefronts. Instead of threads, OpenCL uses work items and workgroups. It is usually JIT compiled but offers an intermediate pre-compiled option for speed called SPIR-V. OpenCL drivers are available for many devices including GPUs and CPUs. The language is based on a subset of C99, with an extension called SYCL that implements C++11. SYCL or OpenCL C++ available in version 2.2+ is out of scope here (hopefully a later addition). Have a look at the above saxpy sample converted to OpenCL:

kernel void saxpy(int n, float a, global float* x, global float* y)
{
const int i = get_global_id(0);
if (i < n)
y[i] += a*x[i] + y[i];
}

Anything performed via OpenCL requires a context. An OpenCL context is analogous to a container. It contains all of our buffers and command queues and can utilize one or more devices within a single platform. One of the tricky bits of OpenCL is context configuration. Also a context can’t be shared between processes, making IPC tricky. Since it’s an open standard, you’ll often find vendor support has mixed experiences. You’ll also find examples of code that use different host languages — some samples are written in Python, some C, some C++, etc. All of these options require you two know two languages — one to perform host operation with setup on the CPU and one to actually execute on the GPU as OpenCL code. Also most code is stateless consisting of the following steps:

1. Set up OpenCL context with selected device(s).
2. Create a command queue.
3. Load and compile a CL program.
4. Allocate and load required buffers.
5. Enqueue your kernel(s) from the program.
6. Read results.
7. Destroy buffers, queue, and context.

Most of the overhead is in repeated setup, buffer writing/reading, and destruction and negates the performance gain of the GPU and highlights one of OpenCL’s adoption pains. Take a look at this complex but basic vector addition example from Oak Ridge Labs: https://www.olcf.ornl.gov/tutorials/opencl-vector-addition/

Address Space

Note the special keywords on kernel parameters. Inside the kernel what looks like a pointer into GPU memory is actually a mapped pointer during the kernel execution. In fact if you store a handle/pointer into a buffer for use in further kernels, the resulting pointer will be invalid since mappings get reassigned each run. Don’t store buffer pointers! Instead you’d need to store simple offsets or use kernel pipes which are an advanced topic.

Constant address space is reserved for read-only memory and has super fast access since each Compute Unit can cache a local copy without worrying about synchronization issues. Unfortunately constant address space on devices is fairly limited — about 64k in my devices. Most read/write buffers tend to be global which isn’t best for performance but helps with atomic operations and synchronization. You’ll see sometimes our OpenCL 1.2 code will need to define multiple copies of the same function since we need different address space options. OpenCL 2.0 added generics, which can automatically compile options for all address space. To keep things simple as you learn, use global address space until you get comfortable.

Architecture and Features

A GPU won’t run standard x86 obviously. Each has their own architecture and specifications, including endian and register width. If you exchange binary data to a device with an endian mismatch, hold off on publishing that dissertation. Also certain devices handle different native types, such as fp16, fp64, and most importantly, each device may have its own alignment rules. Where you think you could just put a struct pointer on a buffer from the CPU, you may find that alignment causes fields to be misaligned — often with catastrophic results or segfaults. We’ll explore this all in depth in section 4: The Ugly.

Mixed architectures mean we may need to jury rig buffers to align and crunch that particle accelerator data.

Device Partitioning and CGroups

A context by default can utilize the entirety of devices allocated to it. What if you need to share without hogging all resources? Device partitions or sub-devices introduced in OpenCL 1.2 allow you to request an upper limit on your context resources. This can be used to prevent someone claiming all your device(s) memory or compute units. This is especially useful if you’re using your primary display driver for long-running kernels. You may find by accident you schedule an infinite loop or long-running kernel that can’t be stopped with a simple SIGHUP. The GPU may not have enough time to update your 60hz refresh and instead will freeze the screen. Inevitably most drivers will reset the GPU after a predefined timeout to prevent locking up a system completely. The clinfo command can be used to find out how many partitions or sub-devices your device supports:

$ clinfo | egrep "Device Name|sub-devices"
Device Name Intel(R) HD Graphics
Max number of sub-devices 0
Device Name Intel(R) Core(TM) i7–8550U CPU @ 1.80GHz
Max number of sub-devices 8

Device partitions will be further enhanced by a new kernel patch in Linux ~5.3+ introduced to allow GPU resource limitations using standard CGroups. This will be an instant wins for container users and Nomad in particular which already uses limited CGroups for exec jobs in addition to containers.