Implementing request-based autoscaling for machine learning workloads

Original article can be found here (source): Artificial Intelligence on Medium

Implementing request-based autoscaling for machine learning workloads

Machine learning inference workloads can be particularly challenging to scale, for a few reasons:

  • Models can be huge. Modern models routinely exceed 5 GBs, meaning they require larger—and more expensive—instance types.
  • Concurrency is a pain. It’s not uncommon for a single prediction to utilize 100% of an instance. As a result, some workloads require several instances to serve fewer than 10 concurrent users.
  • Latency can be expensive. For many use-cases, GPU inference is the only way to achieve acceptable latency, as we documented in our benchmark of GPT-2. GPU instances are not cheap, comparatively.

With Cortex v0.14, we’ve implemented a new request-based autoscaling system. It allows Cortex to autoscale inference workloads more dynamically and precisely, resulting in clusters with just enough instances to serve predictions with minimal latency and no unused resources.

This article goes in depth on our many attempts to build this new autoscaler, and how it ultimately came together.

First, why not use EC2’s out-of-the-box autoscaling?

Cortex currently only supports AWS (though other clouds are on our short-term roadmap), and AWS EC2 supports instance autoscaling using CloudWatch metrics. Why not just use that?

EC2’s autoscaler has a major limitation which makes it suboptimal for Kubernetes clusters: It monitors at the instance level.

The basic problem with this is that EC2 doesn’t know what is actually running on each instance, it only knows the overall CPU usage for the entire VM. Within a Cortex cluster, however, a given instance may host replicas of multiple APIs.

So, and this example is intentionally simple, if an instance has:

  • One replica of a text generating API
  • One replica of a sentiment classifying API

And each replica is allocated half of the available resources on the instance, what happens if the text generating API uses 100% of its resources while the sentiment classifying API uses 0%?

From EC2’s perspective, the instance is at 50% utilization and doesn’t need to autoscale—even though the text generating API clearly needs more replicas, while the sentiment classifying API can be scaled down.

EC2’s standard approach of tying autoscaling to CPU usage, however, seemed on the right track to us—it just needed more insight.

Idea #1: Autoscaling based on replica CPU utilization

Our initial solution was to maintain the utilization-based approach to autoscaling, but on the replica level—not the instance level.

Kubernetes’s Horizontal Pod Autoscaler comes built in with the ability to autoscale replicas based on CPU utilization, meaning all we had to do was pass it CPU utilization metrics from individual replicas. To gather that information, we deployed the Kubernetes Metrics Server, which exposes replica-level CPU usage via its Resource Metrics API.

With some wrangling, we had a functional autoscaling system. When replicas were consuming too much CPU, the pod autoscaler spun up new ones. When there was no room on an instance for new replicas, the cluster autoscaler launched a new instance. Simple and clean.

We quickly, however, ran into issues.

The first and most obvious was that many Cortex users run GPUs for inference. For APIs that spent most of their time running GPU cycles, CPU utilization could remain low while GPU utilization hit 100%.

This issue had a conceptually straight-forward fix: Monitor GPU utilization. But while combining GPU and CPU utilization into a single metric was, in principle, straightforward, it proved to be harder than we expected, and by then we had come up with a better idea.

Idea #2: Autoscaling based on inference latency

The impact of exceeding a replica’s resources, regardless of whether it is deployed on a GPU or CPU instance, is a spike in latency. As such, our next idea was to build an autoscaler that used a user-defined latency target as its trigger.

The idea was attractive for a couple reasons:

  • It had been suggested by Cortex users as well as our internal team.
  • It could be implemented using metrics Cortex already tracks.

As we dug deeper, however, we found a number of issues with this idea.

First, assuming that users know their target latency before launching their API is risky. If a user’s update to their model made inference take longer than their target latency, that would trigger infinite scaling. Probably not ideal.

Second, inference latency is a poor signal for scaling down clusters. The latency of a request only changes when a replica is oversubscribed. Whether a replica is barely being used, or is at 90% utilization, latency will be the same so long as there are no queued requests. If latency does not decline as an instance becomes more idle, it cannot be trusted as a good signal for scaling down.

Finally, if a user’s prediction API makes calls to a third-party API, any slowdowns in that third-party API will be reflected in the inference latency, potentially triggering unnecessary scaling.

Overall, we decided latency was too flawed of a signal for autoscaling.

Idea #3: Autoscaling based on queue length

Our final idea for autoscaling took a fundamentally different approach. Instead of looking at our API’s performance — either via resource utilization or latency — we thought, why can’t our autoscaler calculate how many instances are needed simply by counting how many requests are coming in?

In order to implement this request-based autoscaling—which is how Cortex autoscaling currently works—Cortex only needs to know how many concurrent requests there are (in other words, how many requests are “in flight,” or have come in and have not yet been responded to).

Writing the autoscaling logic was straightforward, even with all of the knobs we added for users to configure its behavior. What wound up taking days of extra engineering time was counting the damn requests.

Because of the way we’d built Cortex to use Flask and Gunicorn, we weren’t able to count requests as they came into the queue, but only after they were picked up off the queue for processing. Without knowing how many requests were queued, Cortex had no way of knowing whether to scale or not.

We had a few initial ideas for getting around this, including:

  • Building a request forwarder that would count requests as they came in and forward them to the APIs.
  • Switch from Istio to Gloo for our networking layer, since Gloo seems to support a metric for in-flight requests.

The first idea seemed like it could cost a disproportionate amount of engineering time, and the second would require us to rewrite our networking code and deploying a Prometheus service.

Instead, we decided to replace Flask and Gunicorn with FastAPI and Uvicorn, which offered native support for async request processing. Due to Uvicorn’s async event loop, we can count requests as the come in before they are processed in a separate thread pool.

Cortex calculates how many replicas to autoscale to by dividing the total number of concurrent requests by the concurrency capacity of each replica. This capacity is defaulted to 1, but users can change it by setting a target_replica_concurrency parameter in their config files.

As a result, Cortex clusters autoscale to exactly as many replicas as are needed, keeping costs as low as possible.

On the roadmap for request-based autoscaling

There is one nuance we haven’t accounted for yet in implementing request-based autoscaling, which we’re looking forward to working on in future versions.

Right now, Cortex’s default behavior is to autoscale to however many replicas it takes to achieve the target per-replica concurrency, within the boundaries set by the user. What Cortex doesn’t account for, however, is the rate of change in queue length.

For example, if you have a speech-to-text API and requests are queuing, it stands to reason that your cluster needs to scale up. But, if that queue is actually shrinking in length, it means that your current number of replicas is actually capable of handling the current number of incoming requests.

In this situation, it’s possible that by the time your new replicas are running, your queue will have already been depleted. In some cases, it would actually be best to let already-active replicas clear the queue on their own.

But… that’s a problem for the next sprint.