Parallelizing Heavy GPU Workloads via Multi-Queue Operations

Original article was published by Alejandro Saucedo on Deep Learning on Medium


Parallelizing Heavy GPU Workloads via Multi-Queue Operations

Achieving 2x+ speed improvements on large GPU-intensive workloads by leveraging multi-queue operation parallelism using Vulkan and Vulkan Kompute

Operation Execution in Parallel through Multiple Family Queues (Image by Author)

GPUs have proven extremely useful for highly parallelizable data processing use-cases. The computational paradigms found in machine learning & deep learning for example fit extremely well to the processing architecture GPUs provide. Recent improvements in GPU architectures are now opening for further optimizations, particularly relevant for running heavy GPU workloads that can be concurrently . That is, running multiple workloads concurrently in the GPUs themselves by leveraging the underlying hardware GPU queues available. Practical examples that would benefit from this, include model parallelism and data parallelism techniques in machine learning.

This article provides a conceptual and practical deep dive on how it is possible to leverage the multiple hardware queues and queue families provided by GPUs to submit workloads that can run at the same time. We will show how we can achieve a 2x speed improvement on a synchronous example by simply submitting the workload across two queue families. This will be an important point as recent announcements from NVIDIA’s Ampere GA10x architecture which enable for 3x speed improvements make it clear that this trend will only continue to bring further improvement opportunities in this area.

We will be using Vulkan and the Vulkan Kompute framework. More specifically we will cover:

  • Disambiguation of “synchronous” and “parallel” in GPU processing
  • A base synchronous example that we will build upon
  • Steps to extend the example for asynchronous workload submission
  • Steps to extend the example for parallel multi-queue GPU processing

You can find the full code and run it here — instructions on how to run the full suite using CMAKE can be found in the main Kompute repository build section.

If you are interested to learn more about the frameworks used in this post you can also check out the following articles:

Asynchronous vs Parallel Processing

Before diving into the code, it is important to disambiguate two concepts — asynchronous workload submission and parallel workload processing.

Simplified Vulkan Architecture (Image by Author)

Without going into too much detail, the way parallel workloads are submitted for processing when using the Vulkan SDK, is through GPU Queues. This can be visualised in the simplified Vulkan Architecture image (pipeline and descriptor components were left out for simplicity).

Asynchronous Workload Submission

Asynchronous processing encompasses the ability for the CPU host side to be able to do other work whilst the GPU is processing the workfload. Other work can include calling other C++ functions, or even submitting further work to the same or other GPU queues. When the CPU wants to check whether the workload is finished, it can use a Vulkan “Fence” which is basically a semaphore resource that allows the CPU to be notified when a GPU workload finishes.

The important point to note is that when multiple workloads are submitted to the queue, even if these are done from multiple C++ threads, if these are submitted to the same queue, the expected execution ordering will still be sequential.

Parallel Workload Processing

Parallel workload processing consists of the concurrent processing of two or more workloads by the GPU at the same time. More specifically, if you had two tasks that took 10 seconds to process, the theoretical parallel execution would still take 10 seconds as both would be carried out at the same time.

In order for parallel workload processing to be achieved, this is something that first and foremost has to be supported by the underlying GPU. The reason why this is important is because even if you were to submit workloads across different GPU queues, the processing may still be done sequentially by the underlying hardware based on its limitations.

Base Sequential Processing Example

We will have a look at the code that we will be using throughout this article. Initially the code is written in a sequential way, and we will be able to then convert it into asynchronous code, and finally into parallel code. We will basically be running a workload where we will be doing the following:

  1. Creating a Kompute Manager to orchestrate all GPU work
  2. Create the Kompute Tensors in CPU host that will be used to process data
  3. Map the Kompute Tensors into GPU Device memory
  4. Define compute shader which will keep the GPU busy for a few 100s ms
  5. Run compute shader in the GPU using the Tensors for data processing
  6. Map results of the Kompute Tensors back into CPU Host memory
  7. Verify that the operation was successful

For measuring time we will be using <chrono> from the standard library. Mainly calculating the difference across a start and and end time retrieved with std::chrono::high_resolution_clock::now() as follows:

You can find the runnable code in this file, which is part of the Vulkan Kompute test suite.

1. Creating a Kompute Manager to orchestrate all GPU work

First we have to create the Kompute Manager, which holds performs all the required memory management, and in our case, it creates all required Vulkan resources. By default the Kompute Manager will pick GPU Device 0, but you are able to pass the specific device index you would prefer to initialise with, and if preferred you can pass your Vulkan resources if you already have a Vulkan application.

2. Create the Kompute Tensors in CPU host that will be used to process data

We will now be able to create a set of Kompute Tensors. We first initialise the data in the CPU Host, consisting of an array of zeros with length of 10. We will be using two tensors as we’ll be running two algorithm executions. We will be able to check these Kompute Tensors at the end to confirm that the execution has been successful.

3. Map the Kompute Tensors into GPU Device memory

Stanford CS149 Course 2019 Slides

We are now able to copy the host data of the Kompute Tensors into the GPU Device memory.

This is an important step as by default the Kompute Tensors use device-only-visible memory which means that a GPU operation will need to copy it with a staging tensor.

Vulkan Kompute allows us to create the buffer and GPU memory block, as well as performing a copy with a staging buffer through the kp::OpTensorCreate operation.

4. Define compute shader which will keep the GPU busy for a few 100s ms

The compute shader that we create has a relatively large loop to simulate an “expensive computation”. It basically performs a unit addition for 100000000 iterations and adds the result to the input Tensor.

5. Run compute shader in the GPU using the Tensors for data processing

Now we are able to submit the compute shader for execution through the kp::OpAlgoBase operation. This basically allows us to perform a submission of the shader with the respective tensor. This initial implementation runs the execution synchronously, so it will first run the execution of the shader with tensorA, and then the execution of the same shader with tensorB.

6. Map results of the Kompute Tensors back into CPU Host memory

Finally we want to retrieve the results from the GPU device memory into the CPU host memory so we can access it from C++. For this we can use the kp::OpTensorSync operation.

7. Verify that the operation was successful

Finally we can just check that both resulting kp::Tensor contain the expected value of 100000000.

Extending for Asynchronous Workload Submission

The steps that we will need to extend for asynchronous submission in this case are quite minimal. The only thing we need to do is to substitute the evalOpDefault function for the evalOpAsyncDefault function, and then using the evalOpAwaitDefault(<timeInNanoSecs>) to wait until the job is finished. This basically would look as follows:

As you can see we are able to submit two tasks for processing asynchronously, and then wait until they are finished with the Await function.

One thing to mention is that in the implementation above we are only waiting for the second submitted operation, as every time that evalOpAsyncDefault is called, it creates a managed sequence. The proper way to deal with this will be in the next section, and involves using explicitly created sequences.

Extending for Parallel Workload Processing

Now that we know we are able to execute multiple workloads asynchronously, we are able to extend this to leverage the multiple queues in the GPU to achieve parallel execution of workloads.

Running on an NVIDIA 1650 Video Card

In order to show a useful example, we will dive into how this would be achieved in an NVIDIA 1650 video card. You are able to try this yourself by checking the device report of your video card — namely on the queue families and parallel processing capabilities available.

Conceptual Overview of Queues in NVIDIA 1650 (Image by Author)

The NVIDIA 1650 GPU has 3 queue families. Using G for GRAPHICS, T for TRANSFER and C for compute capabilities, in has a G+T+C familyIndex with 16 queues, a T family on familyIndex 1 with 2 queues, and a T+C family on familyIndex 2 with 8 queues.

As of today, NVIDIA does not support parallel processing of workloads when work is submitted within multiple queues of the same family. However it supports parallelizing between queue families. This means that workloads between graphics and compute queues can be parallelized — we will be using this knowledge in our implementation.

Implementation of Parallel Workflow Execution

So far we have been submitting all GPU workloads to a single queue, namely the GRAPHICS family index 0 using the underlying queue index 0. As we mentioned briefly, in our case using the GPU 1650, we will be able to achieve parallel processing if we submit workloads across the GRAPHICS family and the COMPUTE family. The diagram below should provide an intuition on what we will be doing.

Operation Execution in Parallel through Multiple Family Queues (Image by Author)

In order for us to do this, we will need to modify three key things:

  1. Kompute Manager is initialised with the respective queues available
  2. We create two Kompute Sequences with each respective queue allocated
  3. We run the operations on each respective queue

We will dive into each of these three points.

1. Kompute Manager is initialised with the respective queues available

When initialising a manager we are able to pass an array containing the queues that we would like to fetch. In this case, we only fetch one graphics queue and one compute queue, however, based on the hardware specs of the NVIDIA 1650, we would be able to request up to 16 graphics queues (familyIndex 0), 2 transfer queues (familyIndex 1), and 8 compute queues (familyIndex 2).

2. We create two Kompute Sequences with each respective queue allocated

Now we are able to explicitly initialise two managed sequences, each allocated to a different queue, referencing the index of the array we passed in the previous step.

3. We run the operations on each respective queue

Now we are able to run operations in each respective queue. In this case both of the GPU workloads are submitted in parallel.

Parallel Workload Execution Results

When running the code provided above, we can see a 2x speed improvement on execution time thanks to the parallel family queue submission of workload. You can also see that if we were to add extra queues from the GRAPHICS or COMPUTE queues, we would not see any further speed improvements as intra-queue parallelization is not supported in this NVIDIA 1650 card.

You can find the full code and run it in this file — instructions on how to run the full suite using CMAKE can be found in the main Kompute repository.

This is a particularly important result, as based on the recent announcement from NVIDIA coming together with the release of their 300x video cards, there are improvements via the Ampere GA10x architecture that allows for two compute workloads simultaneously. Relative to the example above, this means that we could see a 3x improvement if we were to use one GRAPHICS queue and two COMPUTE queue (together with the extra performance using the TRANSFER queue for transfer operations).

Next Steps

Congratulations, you’ve made it all the way to the end! Although there was a broad range of topics covered in this post, there is a massive amount of concepts that were skimmed through. These include the underlying Vulkan concepts, GPU computing fundamentals, and more advanced Vulkan Kompute concepts. Luckily, there are resources online to expand your knowledge on each of these. Here are some links I recommend for further reading: