Watch Me Scale: A Story of Productivity in Deep Learning with Metaflow and Kubernetes

Original article was published on Deep Learning on Medium

❓What is Metaflow?

The Hierarchy of Needs in Data Science/Machine Learning/Artificial Intelligence.

During the last 6 months, I investigated different available options for versioning/scaling. There was a common pattern that emerged in the development of solutions for ML productivity: all solutions created in the ML productivity space leverage abstractions around the layers present in the hierarchy of needs shown in the above figure.

One of the most common approaches to solving such ML pipelining problems is by the transformation of the workflow into a Directed Acyclic Graph (DAG) following the dataflow paradigm. Many workflow engines that are currently available, such as Airflow Flyte, etc., are solving productivity problems for a lot of businesses. One of the biggest drawbacks of most of those technologies is their tight coupling with their computational backends which make them not so useful for prototyping in research.

In December of 2019, Netflix open-sourced its internal Data Science Python library, Metaflow. The library was built to reduce the time-to-market of Data Science initiatives at Netflix by tackling common pain points related to infrastructure and data engineering.

Building a data product requires a broad set of skills and Data Scientists generally do not possess all of these skills. Metaflow provides a human-friendly way of bridging the gap between most of these skills.

An example of how Metaflow represents a given data flow graph as code (Metaflow Website)

Metaflow offers a simple Pythonic way of defining distributed steps of computation as a DAG with common data being shared across states. This enables the parallelization of experiments pertaining to the model. Distributed Parallelization is enabled by running the multiple individual steps in a Dockerized environment with an easy specification of resources. Metaflow ensures the storage of data across steps and so the re-computation cost is reduced because of this library. This also enables quick prototyping and analytics. Metaflow also provides a way to quickly access results on completion of every workflow. The most powerful feature of Metaflow is that the library is self-contained, meaning the same source code is run on a local machine or on a Dockerized distributed computation setup. The library’s design choices around packaging isolated computation make it extremely adaptable for new compute and data warehousing abstractions.

After reading the library’s core and understanding its flexibility, I WAS SOLD. Even though Metaflow was open-sourced with AWS support, I realized that the library’s design made it extremely simple for integration into other computing environments for quick scale-out. The flexible code structure of this library gave me thoughts around exploring two of my favorite aspects of computer science at the same time. Large Scale Cluster Computing and Artificial Intelligence. Through the last 4 months, I prototyped a plugin on Metaflow to support Kubernetes as a compute engine and built a robotics project using a well-known simulator with the help of this plugin. The robotics project leveraged Metaflow as means to log metrics, craft different experiments, and scale-out computation.

If you are unaware of what is Kubernetes, please don’t worry. Kubernetes, in the most simplistic sense, is a platform that helps you deploy/configure/scale different applications across a cluster of servers by leveraging Docker containers. It started as an internal project in Google(named borg) since 2003 and finally got open-sourced in 2015.

There were two major reasons which inspired me to build a Kubernetes plugin on Metaflow.

Containerization for Reproducibility

The Ultimate Developer Excuse

Metaflow leverages the power of Docker containers to run isolated execution steps in a distributed fashion. Before this project, the library only supported the use of Docker containers when running on AWS Batch. Docker support for local runs is really important because it avoids the problem highlighted in the figure shown to the left. The current support for Kubernetes with Metaflow enables reproducing localized tests for production workflows.

With the help of Minikube (a localized version of Kubernetes cluster), one can prototype workflows on their local machines which can mimic production architectures. Local prototyping can be done even for GPU workloads. Enabling efficient prototyping of flows through Docker containers on local machines makes a huge productivity difference while building large flows that have different dependencies.

Horrors of Vendor Lock-In

One major drawback of Metaflow was its AWS coupling. Due to this, anyone using the library’s capabilities is locked into AWS’s solution. For a business, this is termed as Vendor lock-in. Vendor lock-in is dangerous for businesses. It is a scenario where a business is restricted in making changes to their system due to a tight coupling with certain vendors (e.g. Cloud services like AWS, Digital Ocean, etc.). 4 years working for a growing startup in India teaches you how nasty the costs can get when you are vendor locked in. (THANKS AWS!). With open-source, moving systems around becomes much simpler and, in turn, affects developer productivity less. This plugin will enable Metaflow to operate Cloud platform independently because of integration into an open-source container orchestration framework.

The next section helps explain how Metaflow Works and the means through which I integrated Kubernetes.