Deep Learning Inference at Scale

Original article was published on Deep Learning on Medium

Deep Learning Inference at Scale


Dashcams are an essential tool in a trucking fleet, both for the truck drivers and the fleet managers. Video footage can exonerate drivers in accidents, as well as provide opportunities for fleet managers to coach drivers. However, with a continuously running camera, there is simply too much footage to examine. When a KeepTruckin dashcam is paired with one of our Vehicle Gateways, the camera only uploads the footage immediately preceding a driver performance event (DPE), which is an anomalous and potentially dangerous driver-initiated event (e.g. hard braking, swerving, etc.) With all of the videos uploaded per day, fleet managers need to sift through the incoming data so that they can direct their attention to the most important videos for safety analysis. And of the selected videos for viewing, they need video overlays to more easily understand what happened in them. These video overlays are computed using deep learning models on the video frame images.

In this post, we’ll discuss the set of video overlays we provide, along with the requirements and architecture of the system we have built to power these overlays.

Video Overlays

We provide the following video overlays to ease the burden for fleet managers:

Close Following

Road-facing dashcam videos are overlaid with lane markers, vehicle bounding boxes, and time estimations.

Lane markers provide clear indication of vehicles making unsafe lane changes, especially when the physical lane lines themselves are not easily visible. Distance and time estimations allow fleet managers to quantitatively understand the context around a DPE. For example, a raw video (without overlays) could tell the fleet manager that another vehicle cut in front of the driver, whereas the video overlaid with time estimations would show that the vehicle did so in violation of the 2-second following distance rule.

Driver Distraction

Driver-facing dashcam videos are overlaid with 3D head pose bounding boxes.

By viewing an overlaid video, a fleet manager could see if the driver was distracted. If so, the fleet manager can take this opportunity to coach the driver.

System Requirements

Our system for producing video overlays has the following requirements:

It must have high throughput and low latency

We process a total of over 40 million video frames a day. Each video must be processed within an SLA of 5 minutes. Our video postprocessing is compute-intensive and takes several minutes, which only leaves a budget of approximately 20 seconds for the deep learning inference step. This means that we need to consistently process videos at a rate of 17 frames per second.

It must be cost-effective and scalable

While these video overlays provide value to our customers, we need to ensure that we are performing these services in a cost-positive way. This is challenging because unlike a typical web service, deep learning inferences run on GPU instances, which are significantly more expensive [0] per unit of time. Our video traffic also varies greatly depending on the time of day — 70% of our videos come between 4am and 1pm PST. Therefore, we need to automatically scale our infrastructure, and the high cost of GPU resources means that we need to be precise in doing so. And with these tight constraints on cost, we cannot rely as much on overprovisioning to meet our SLA constraints.


Here is the lifecycle of a video within the inference system:

  1. Our fleet dashboard web application inserts an entry into our SQS queue. The entry consists of an S3 video URL and a webhook callback endpoint.
  2. One of the application workers pulls an entry from the queue. The application worker downloads the video from S3 and breaks it apart into frame images.
  3. The application worker uses our inference client library to make calls in parallel to the necessary model inference endpoint(s).
  4. After collecting the inference results from the inference endpoints, the application worker runs some post-processing logic and then creates a local copy of the video that has overlays drawn onto it.
  5. The application worker submits the video for transcoding.
  6. Once the video is transcoded, the application worker uploads the transcoded overlaid video to S3 and calls the webhook callback endpoint. This notifies the fleet dashboard web application that a new overlaid video is ready for viewing.

Next, we’ll discuss the inference related components of the system in more detail.

Application Worker

Our application worker is written in Python due to its extensive library support for machine learning and image processing. In particular, the application worker makes heavy use of opencv and numpy. The application worker runs as a service in our Kubernetes cluster, using normal EC2 CPU instances.

Inference Client Library

Since our application worker is written in Python, we also built our inference client library in Python. The client library uses multiprocessing to make parallel gRPC inference requests to the appropriate TensorFlow Serving endpoints. Ideally, we would have used asyncio to make the parallel requests, because the requests are of course I/O bound. However, at this time of writing, the Python gRPC library does not yet have asyncio support implemented.

Endpoint Registry

The endpoint registry is a service discovery component that the inference client library queries to determine which TensorFlow Serving endpoint to use for a given (model, version) pair. This saves the application worker from having to figure out the inference endpoint mappings itself. This dynamic lookup of the endpoint is useful for A/B testing, where the endpoint return for a given (model, version) pair is randomized. In order to enforce this randomization, clients cannot be trusted to maintain their own endpoint mappings anyway.

Inference Endpoints

Our inference endpoints are TensorFlow Serving containers that run as services in our Kubernetes cluster. We run our deep learning endpoints in a Kubernetes namespace that we dedicate for GPU instances only. By splitting the application worker and the inference endpoints across namespaces like this, we can maximize our GPU utilization. The only computations that are running on our GPU instances are Deep Learning inferences, along with the necessary overhead that TensorFlow Serving itself introduces. There is no preprocessing or postprocessing that occurs on the GPU instances — those steps are instead done in the application worker.

Scheduled scaling

We initially tried scaling the application worker on memory usage and the inference endpoints on CPU usage. However, this did not work well for us for various reasons.

Scaling the application worker on memory usage led to erratic changes in the number of pods. This is because the memory footprint of the application worker varies over the lifecycle of processing a video. One pod would finish processing a video, which resulted in its own memory footprint shrinking. This would in turn reduce the average memory utilization across the entire deployment. However, this could result in terminating a different pod that was processing a video, which would impact our SLA more significantly than would a typical web service that serves short-lived requests. The key observation is that we need to scale on our application workload, not system-level metrics. In other words, we want to scale based on the number of input items in the queue. System-level metrics can be volatile, and volatility is undesirable when the cost of retrying a request is high.

Scaling the inference endpoints on CPU usage was problematic in a similar way. Our first attempt at improving scaling was to scale on GPU usage instead. This made intuitive sense, because after all, GPU inference was the primary workload for the endpoints. Surprisingly, we found that scaling on GPU usage was actually worse than scaling on CPU usage. This is because there is an amount of CPU overhead that occurs before the GPU receives its first byte. The TensorFlow Serving web server needs to pull incoming data from network buffers, deserialize the tensor payloads, and then load the payloads onto the GPU. By scaling on GPU usage, we are scaling on the resource that is second in the sequence. By the time GPU utilization ramps up, CPU utilization has already been saturated. Essentially, we’re “too late.” We then approached the scaling problem from an application workload perspective. We empirically determined the maximum throughput of a given TensorFlow serving endpoint, and with that, we were able to compute an appropriate ratio of queue items:application worker pods:inference pods.

Armed with this ratio, we looked at our counts for incoming videos throughout the day. Our workload looked periodic, and so we deployed a schedule that scaled the application worker and inference endpoint pods proportionally to the expected traffic at each hour, while maintaining the aforementioned pod ratios.

What’s next

This system has been running in production and processing videos for the past month. We have met our SLA all throughout, and we have done so within our cost budget. That said, we on the Machine Learning Platform team still have a lot of items on our roadmap:

  • A/B testing
  • Support for more frameworks
  • Model training system

We will also be writing a follow-up blog post about our prior experience using AWS SageMaker to build this inference system.

Interested in building machine learning infrastructure? Come work for us!

Thanks to our Platform team for helping us build out GPU nodes on our Kubernetes cluster.


[0] At this time of writing, we use p3.2xlarge GPU instances for inference and m5.12xlarge CPU instances for the application worker. These have an on-demand hourly cost of $3.06 and $2.304, respectively. However, we are able to pack 8 application worker pods on a single m5.12xlarge, whereas we use the entire p3.2xlarge GPU instance. This means that the comparison is really between $3.06 and $0.288, which is a difference of over 10x. However, we have plans to pack more models on a single instance, which should reduce this multiplier. Another way of comparing these two instance types is by cost per core-hour, which gives $3.06 / 8 = $0.3825 vs $2.304 / 48 = $0.048. This is a ~8x difference.