Why are data scientists doing DevOps?

Original article was published on Artificial Intelligence on Medium

Why are data scientists doing DevOps?

Data scientists are doing two different jobs

If you were going to diagram a production machine learning pipeline, the beginning—designing and training models, etc.—would obviously belong to the data science function.

At some point, typically when it’s time to take models to production, a normal pipeline will transition from data science to infrastructure tasks. Intuitively, this is where the data science team hands things over to someone else, like DevOps.

But, this is not always the case. More and more, data scientists are being asked to handle deploying models to production as well.

According to Algorithmia, a majority of data scientists report spending over 25% of their time on model deployment alone. Anecdotally, you can verify this by looking at how many data scientist job postings include things like Kubernetes, Docker, and EC2 under “necessary experience.”

Why data scientists shouldn’t have to handle model serving

The simplest answer here is that model serving is an infrastructure problem, not a data science problem. You can see this by just comparing the stacks used for each:

Model development vs. model deployment

There are of course some data scientists who like DevOps and can work cross-functionally, but they are rare. In fact, I would say the overlap between data science and DevOps is frequently overestimated.

To flip things around, would you expect a DevOps engineer to be able to design a new model architecture, or to have a ton of experience with hyperparameter tuning? There probably are DevOps engineers who have those data science skills, and everything is learnable, but it would be odd to consider those responsibilities the domain of your DevOps team.

Data scientists, in all likelihood, didn’t get into the field to worry about autoscaling or to write Kubernetes manifests. So why do companies make them do it?

Companies are neglecting their infrastructure

Among many organizations, there’s a fundamental misunderstanding of how complex model serving is. The attitude is often “Just wrapping a model in Flask is good enough for now.”

The reality is, serving models at any scale involves solving some infrastructure challenges. For example:

  • How do you update models in production automatically—without any downtime?
  • How do you efficiently autoscale a 5 GB model that runs on GPUs?
  • How do you monitor and debug production deployments?
  • How do you do all of this without running up a massive cloud bill?

Now, to be fair, ML infrastructure is a fairly new concept. Uber only revealed Michelangelo, their cutting-edge internal ML infrastructure, two years ago. The playbook for ML infrastructure is still being written in a lot of ways.

However, there are still plenty of examples of how an organization can separate the concerns of data science and DevOps, without the engineering resources of an Uber.

How to separate data science and DevOps

My opinions on these topics are mostly informed by my work on Cortex, our open source model serving platform. We designed Cortex to delineate data science from DevOps, and to automate all the infrastructure code we were writing. Since open sourcing, we’ve worked with data science teams who’ve adopted it, and their experiences have also informed our approach.

We conceptualize the handoffs between data science, DevOps, and product engineering with a simple, abstract architecture we refer to as Model-API-Client:

  • Model. A trained model, with some kind of predict() function that engineers can use without needing data science expertise.
  • API. The infrastructure layer that takes a trained model and deploys it as a web service. We built Cortex to automate this layer.
  • Client. The actual application that interacts with the web service deployed in the API layer.

In the model phase, data scientists train and export a model. They may also write a predict() function for generating and filtering predictions from the model.

They then hand this model off to the API phase, at which point it is entirely the DevOps function’s responsibility. To the DevOps function, the model is just a Python function that needs to be turned into a microservice, containerized, and deployed.

Once the model-microservice is live, product engineers query it like any other API. To them, the model is just another web service.

The Model-API-Client architecture is not the only way to separate the concerns of data science and engineering, but it serves to illustrate that you can draw a line between data science and DevOps without introducing extravagant overhead or building expensive end-to-end platforms.

By just establishing clear handoff points between functions in your ML pipeline, you can free data scientists up to do what they’re best at—data science.