Source: Deep Learning on Medium
What is MLOps? Why is it so important? How to do it right!
The company Tom works at wants to get rich, so his boss asks him to use his newly acquired knowledge (3 days workshop) about Machine Learning (ML) to predict the stock price of Apple in the next 2 years (This is not possible, don’t try to do this!). Give me a call if you succeed though…
He starts his ambitious project and does the following:
- He builds a Web scraper to get all the data he needs and saves them locally on his machine.
- He heard some good stuff about XGBoost so he decided to use it.
- He starts some experiments to find the best parameters for his model.
- Tom and his boss are happy with the results and they decide to put the trading bot (with a budget of 200k) into production.
After a few weeks, the New York Times releases a report, that shows Apple did spy on all of its users and Apple’s stock price plummets. The trading bot loses all of its money and Tom’s boss is outraged. “How could you not predict that?”, he asks Tom. Because Tom can not answer this question he gets replaced with David. David is an experienced ML Engineer and is shocked by the state of the project.
Tom, did not use git to version his code. All the data is on his local hard disk and he never heard about CI/CD before, so Tom needs to start from scratch. How can we make this a better experience for everyone this time?
The main goal of MLOps is to make the life of all ML Engineers in a project as pleasant as possible. Communication needs to be clear (Who did what and when?) and trying out new things and releasing them should be as easy as possible. In the next part, we will talk about the steps to ensure this.
1. Keeping track of everything
The main difference between traditional software and ML is that you don’t only have the code. You also have data, models, and experiments.
Writing traditional software is relatively straightforward but in ML you need to try out a lot of different things to find the best and fastest model for your use-case. You have a lot of different model types to choose from and every single one of them has its specific hyperparameters. Even if you work alone this can get out of hand pretty quickly.
The data you train, evaluate and test on changes all the time. You need to make sure to keep track of which data you used for which experiment. This is a relatively new problem.
Versioning code, on the other hand, is a rather old problem and we developed good best-practices to do this well.
Git + DVC
Every developer knows Git. You probably did not hear about DVC (Data Version Control) though. It builds on top of git and is the perfect combination for versioning your code and data. You can save your data anywhere in the cloud. Amazon S3, Microsoft Azure Blob Storage, Google Drive and more, are all supported. I would highly recommend their “Get Started” page if you directly want to dive into the code.
Git + DVC is all you need for versioning your code and data!
After writing the code and getting the data and versioning them both well we can start with the fun ML stuff. Now we need to choose an ML algorithm and start the training. To avoid running the same experiment twice and make the experiment transparent for everyone in the team we will need a good way to document our experiments. This is where mlflow Tracking comes in.
You can use mlflow for other great things too but I want to focus on its experiment tracking feature here. Basically, it is a web app with an API that you can deploy anywhere to make it your central space for recording and visualizing your experiments. Its user interface looks very intuitive as well.
After doing the experiments and training the model, we have a ready-for-production ML model. This is great but not the end of the story. In the next part, I want to talk about the automatic build and deployment process for ML applications.
2. Build and Deployment Pipeline
This is also called CI/CD and stands for Continous Integration / Continous Delivery.
The goal of CI/CD is to automatically build, test and safely deploy your application, so you can iterate quickly when developing new software.
Here I want to focus on CI/CD in the ML context. If you want to read about it in a more general sense, I can highly recommend this article.
ML is a microservice and Docker is awesome
A microservice is a software component that has the following properties:
- It does exactly one thing and does it well.
- It is stateless.
- Has a REST API for communication.
If you easily want to create a microservice you should use Docker. It lets you containerize your application. This means that you can be sure that it runs exactly the same in every environment (there are some exceptions). It is like a little VM for your application.
Your trained model will be at the heart of the container. You will get the model input through a REST API and the output would be the model prediction.
When you are happy with your microservice, you can build it and push it to a Docker Image Registry. This is basically like GitHub for your microservices.
When building your microservice you create a Docker Image. This is a template that tells Docker exactly how to build your microservice from your code. Here you can run the CI (continous integration) part of the CI/CD paradigm.
Here you can test your code quality, run unit tests and measure your models’ prediction time. Only after these tests were passed successfully, you build and push your Docker image. This makes sure that your microservice works correctly.
One big problem that you will face here, is that most ML (especially Deep Learning) models only run on a GPU. This means that you will have to run your CI pipeline on a GPU enabled VM in the cloud. This could get expensive, but there is no way around that. The only solution would be to make sure that your model also runs on the CPU (this is in many cases possible). You will need to evaluate this for your use-case.
After you created your changes and successfully build and uploaded your Docker image, its time to run your awesome new ML microservice. It does not matter where your application runs. It could be in the cloud, a local machine or on the edge the deployment process is always the same.
Running multiple microservices
Often in production, you have multiple microservices (Container) running together, that also have to talk to each other. This is where you would need a container orchestrator. Kubernetes is a great tool for doing this. In a few minutes, you can create a running Kubernetes cluster on the Google Cloud Platform or Azure.
To keep an eye on your application you need a good monitoring system. In production, you want to make sure that your model predicts things that make sense. Your logging system should include information about the model input and the predicted output. There are common solutions that you can use for monitoring. One tool that I liked in particular, is the ELK stack by Elastic. It has some very nice visualization tools and scales nicely. (This is probably overkill when you have a small microservices system).
Retraining our model
This is the last and the most interesting step when doing ML. With time the available dataset that we can train on grows. The model we created in this process, could be improved through re-training on this newly available data. This is also why it is so important to version our data. Without DVC we would lose the overall view on our dataset very quickly.
MLOps is a very young field and the best-practices are not there yet. The approaches and tools I talked about here are only one way of doing things. I think this is a very interesting field right now and I hope I could clearly show you my approach for doing things. If you have some ideas on how to improve this approach, it would make me very happy if you left them in the comments below. We need and we will find these things out only together as a community.
Thank you so much for reading!
If you have any questions, you can find me here: