Source: Deep Learning on Medium
How to run machine learning at scale — without going broke
Machine learning is computationally expensive — and because serving real-time predictions means running your ML models in the cloud, that computational expense translates into real dollars.
Put another way, if you wanted to add a translation feature to your app that automatically translated text to your user’s preferred language, you would deploy an NLP model as a web API for your app to consume. To host this API, you would need to deploy it through a cloud provider like AWS, put it behind a load balancer, and implement some kind of autoscaling functionality (probably involving Docker and Kubernetes).
None of the above is free, and if you’re dealing with a large amount of traffic, the total cost can get out of hand. This is especially true if you aren’t optimizing your spend.
In this guide, we’re going to look at why serving inference from the cloud is so expensive, and we’ll dig into the different things you can do to lower your spend (most of which you can do automatically simply by using Cortex).
Why are inference workloads expensive?
The reason inference workloads are so expensive is simple: machine learning requires a lot of resources, and resources cost money.
Unlike a typical database request you might see in a web app, running a single inference can require a lot of compute power and memory utilization. That’s why GPUs are so useful to ML efforts — they provide more raw horsepower to run inference. This increased power, however, carries an increased price tag.
One advantage, however, is that inference workloads are read-only. In other words, once a model is loaded into a serving process, it doesn’t write anything back to a database. That means losing a server is cheap. This creates an opportunity to save money, which we’ll dig into later.
How much does it cost to run inference at scale?
I’ll be using AWS jargon here, but the general concepts translate to other cloud providers.
The cost of running inference at scale is primarily driven by your EC2 bill. The formula looks something like:
(price of primary instance type) X (# of instances) X (# of hours)
The second driver of cost is the premium you may pay on top of your raw compute for managed services. For example, SageMaker adds a ~40% premium to your EC2 bill, and Fargate adds $0.04048 per vCPU per hour (on the US West AWS region). For small workloads, this may not be a huge concern, but as you run more inference, these mark-ups become significant.
Finally, you need to factor in the cost of services like Elastic Load Balancing (ELB) for load balancing.
Cumulatively, these expenses can add up very quickly, especially if you’re paying for more resources than you actually need.
How to minimize the cost of your inference workload
At a high-level, you need to build your inference cluster from the cheapest building blocks and use them as efficiently as possible.
1. Pick the right instances
GPU infrastructure tends to speed up complex deep learning model inference, but CPU infrastructure may be more cost-effective for simpler models. While some state-of-the-art models like OpenAI’s GPT-2 demand a lot of compute and memory for a single inference, others do not.
Pick the cheapest instance that meets your latency requirements.
2. Spin down instances when possible
In addition to choosing the appropriate instance types for your inference workload, the simplest way to optimize spend is to automatically adjust the size of your cluster based on traffic. In other words, if 5 instances can handle all the traffic, there is no reason to have 8 instances in your cluster. For large clusters handling fluctuating traffic, this is often the biggest opportunity for cost savings.
3. Use spot instances
AWS heavily discounts unused instances called spot instances. The catch is that AWS may interrupt your spot instances at any time with little warning. While this might make it hard to use spot instances for stateful web services, it isn’t a huge deal for web services that are read-only — like your inference workload.
If you set up your infrastructure to handle failovers gracefully, you can use spot instances to significantly lower your AWS bill at little to no risk to your service.
4. Share your instances
Think ridesharing, but for your instances.
If you are deploying multiple models, it’s better to run them on one pool of instances rather than spin up separate infrastructure for each service. The reason for this is that when the resource utilization of one service is low, the other services can use those resources and vice versa.
5. Share load balancers
Instead of paying for a dedicated load balancer to sit in front of every service you launch, register as many endpoints as you can to a single ELB.
6. Separate your inference cluster from other workloads
This may sound like it’s in conflict with point #4, and in some cases, it might be. However, if you’re using GPUs for your inference cluster, you wouldn’t want simple cron jobs to take up capacity or prevent your cluster from scaling down when it otherwise could.
7. Use open source software
This is the final point, and probably the most important on this list. There are free and open source tools for managing your cluster, like Cortex, that will automatically do all of the above for you (i.e. support CPU/GPU instances, handle autoscaling, manage the networking, and support spot instances) — without paying the premium of SageMaker or Fargate.