Fast GPU based PyTorch model serving in 100 lines of Python

Source: Deep Learning on Medium

Fast GPU based PyTorch model serving in 100 lines of Python

Consider you need to create an inference service to support the following applications:

  1. Image classification for an app on a smartphone.
  2. Predictive text for chats using a pre-trained BERT model.
  3. Parallel rollouts of a policy in reinforcement learning.

In these example scenarios batch sizes will be small (often 1) and arrive asynchronously. You could do CPU inference but this will incur higher latency and much lower throughput for the same cost of cloud hardware[1] So for high throughput low latency applications we often want to use a GPU.

The question and problem I’ll attempt to solve in this blog post are:

  1. How do we use effectively use a GPU even when dealing with asynchronous requests and small batches?
  2. How do we serve a PyTorch model from a REST API so the service can be used?

Test setup

All tests and benchmarks conducted on Google Compute Engine.

Instance type: n1-standard-4 (4 vCores, 15 GByte Memory)

GPU: Nvidia K80

PyTorch Model: Torchvision Resnet18 Pretrained Model in Eval mode.

Throughput & Latency vs Batch Size

To characterise the hardware and model configuration I measured the throughput and latency from batch sizes 1 to 192 as the GPU ran out of memory for batch sizes bigger than 192. The results were averaged across 1024 examples for each batch size tested. The throughput can be seen in Figure 1 and the Latency in Figure 2.

Figure 1: Throughput vs Batch Size
Figure 2: Latency vs Batch Size

From the plot in Figure 1, the throughput has an interesting spike at 16 and increases up to a batch size of 64 before falling off. Similarly, the latency in Figure 2 increases linearly before undergoing a small positive step change at 64 and increasing linearly again. The Batch size of 1 has a throughput of 128 Examples / Second while a batch size of 64 is above 400 Examples / Second. An increase of more than 3x.

This is important for our use case because once the GPU is 100% (temporally speaking so the throughput is at maximum regardless of whether the batch size is 1 or 64) any new requests will just add to the latency their inference time to the latency of all requests which is bad. By using a maximum batch size of 64 instead of 1 we serve 3x the inference before entering this saturation region.

Asynchronous GPU execution

It’s common knowledge that PyTorch is limited to a single CPU core because of the somewhat infamous Global Interpreter Lock. This can be a problem when trying to write high-performance CPU but when using the GPU as the primary compute device PyTorch offers a solution. PyTorch Cuda execution occurs in parallel to CPU execution[2]

Here’s a concrete example:

y = cuda_model(x) # Perform forward pass with cuda tensor x.
time.sleep(0.1) # Wait 100ms
y_cpu =“cpu”) # Move output cuda tensor y to cpu.

In this example GPU will be computing cuda_model(x)while the CPU is executing time.sleep(0.1)before they are forced to synchronise by moving the result to the CPU. This is illustrated in Figure 3.

Figure 3: PyTorch Asynchronous GPU Execution

What if we want to keep doing work on the CPU until the GPU is finished? This can be achieved using thetorch.cuda.Eventclass as in the below example.

y = cuda_model(x) # Perform forward pass with cuda tensor 
cuda_event = torch.cuda.Event()
while not cuda_event.query():
time.sleep(0.001) # Wait 1ms
y_cpu =“cpu”) # Move output cuda tensor y to cpu.

In this case, we can repeatedly re-run`time.sleep(0.001)` while we wait for the GPU execution to finish by monitoring the cuda_event.

All this means long as we keep the CPU execution time shorter than the GPU one we can keep the GPU fully utilised by preparing the next batch of data in the place of time.sleep. This how training in PyTorch can fully utilise the GPU even without using the multiprocess library and we can utilise the same feature for asynchronous batch collection.

Asyncio and grouping concurrent requests.

To keep the client code simple I wanted to keep the REST API synchronous. i.e. you request inference with the input data x and the response comes back with the result like a function call. That means server-side, we to have some way of running multiple threads concurrently.

To achieve this I decided to try Asyncio and the Asyncio web framework aiohttp. Having never used Asyncio I found it a little tricky and from what I have read that seems to be a common experience. It does, however solve the concurrency problem at a higher abstraction level than say threading. I found looking at example applications and the official documentation [3] to be very helpful in the learning process.

What we want from Asyncio is to capture multiple HTTP requests asynchronously perform batch inference synchronously and return the results asynchronously. This together with the asynchronous GPU execution should give us the foundation we need to build the inference server.

The best pattern (it might be terrible) I found was creating an asyncio.Future for each request and placing them in an asyncio.Queue to be handled by an inference coroutine. The inference coroutine can remove multiple futures from the queue, perform batch inference and set the result for each of the futures in the batch which will allow their respective “awaiting” coroutines to resume.

Input/Output Tensors

I chose NumPy Ndarrays rather than PyTorch Tensors as the inputs and outputs of the server. Simply because of how ubiquitous and well supported they are. This means that the clients to the server don’t need to know about or have torch installed just NumPy. There are also more options available for serialisation and deserialisation of NumPy NdArrays than PyTorch Tensors.

Serialisation & Deserialisation with PyArrow

To maximise the utilisation of the gpu we want to make the cpu operations as fast so we have time to batch as many requests together as possible. The bulk of the cpu time in the server is consumed by the serialisation and deserialisation of the input and output tensors before sending them over the network so I spent some time speed testing different options.

In the end, I chose to use PyArrow because it was the fastest overall both at serialisation and especially deserialisation which is a zero-copy operation so it is very fast.


I used Prometheus to monitor the code as it is both straight forward and open source. It will also allow dashboards to be built on top using Grafana.


To reiterate the basic principle of operation is buffer inference requests while GPU inference is in progress.

Practically this is achieved by buffering new inference requests until:

  1. Work has been submitted to the GPU and that work has finished.
  2. New request ready and no GPU inference in progress.
  3. Buffer is full to max batch size and another request has been received.

I found this simple logic gave a good tradeoff between throughput & latency while only requiring max batch size to be set.

Inference Server code


— max-batch-size: specifies the maximum batch size to buffer inputs up too.

— model-definition: python code instantiating the “model” object.

— input-shape: Input shape without batch dimension.

Example call

pytorch_inference_server --max-batch-size 64 --input-shape 3 256 256 --model-definition “import torchvision\nmodel = torchvision.models.resnet18(pretrained=True).eval()"


Python 3.7.









To evaluate the performance of the server we need to establish a baseline of the best possible performance. After playing around for some time this is the fastest implementation I could come up with. Note that x is preallocated to simulate ideal batch preparation.

The results were:

  • Batch size 64 : 153ms latency, 421 Examples / Second throughput.
  • Batch size 1: 7.9ms latency and 127 Examples / Second throughput.


I manually measured the test results using the same server arguments as the above example call. I ran parallel synchronous request processes to increase the number of concurrent requests. See the test code below:

The results are averaged across 100 iterations batch sizes of 2 and below and 100 for batch sizes above 2. I didn’t collect the results for the fields marked NULL as the GPU was saturated as described earlier what happens Throughput & Latency vs Batch size section. This manifests itself as doubling latency and constant throughput when doubling request batch size or doubling concurrent requests along each axis respectively. In the plots, I did this exact extrapolation to give them the correct shape. The measured results can be had in Tables 1 & 2 and the surface graphs are in Figures 4 & 5.

Overall I’m super happy with the results. The server seems to do a commendable job at utilising the GPU regardless of what sort of requests you throw at it.

Table 1: Throughput (Examples / Second) test results
Table 2: Latency (seconds) test results
Figure 4: Throughput test results
Figure 5: Latency test results

Possible future improvements

  1. gRPC. I wanted to do this as a gRPC implementation since it sounds like that might potentially be faster and a more future proof design. However, I’m not sure how to manage the concurrency, serialisation or deserialisation and isn’t clear to me if it can be made as performant.
  2. I am considering wrapping this in Kubernetes Knative. To allow for horizontal scaling to multiple GPUs.
  3. Parameter updates. I plan to use this server for reinforcement learning it there will need to be a parameter server and a mechanism for frequent parameter updates.
  4. Increase the complexity of the batching algorithm for better performance.