Jupyter is Ready for Production; As Is

Original article was published by Dimitris Poulopoulos on Artificial Intelligence on Medium


Introducing KubeFlow

Kubeflow is an open-source project, dedicated to making deployments of ML projects simpler, portable and scalable. From the documentation:

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

But how do we get started? Do we need a Kubernetes cluster? Should we deploy the whole thing ourselves? I mean, have you looked at the Kubeflow’s manifest repo?

All we need to run a notebook on Kubeflow is a Google Cloud Platform (GCP) account and your good old ipynb notebook file!

Moreover, say we have Kubeflow up and running; how do we transform our Notebook to Kubeflow pipelines (KFP)? Do we have to build docker images? Have you seen the KFP DSL? I thought the whole point was to eliminate the boilerplate code.

Well, I have good news; all we need is a Google Cloud Platform (GCP) account and your good old ipynb notebook file!

Deploying Kubeflow

I am going to keep this simple without dumbing it down. The truth is that it’s effortless to get a single-node instance of Kubeflow running in minutes. All we need is a GCP account and the ability to deploy applications from the Marketplace. We’re going to use MiniKF!

MiniKF deployment — Image by Author
  1. Go to your GCP console
  2. Search for Marketplace and then locate MiniKF
  3. Click on it and choose Launch
  4. Set the VM configuration, (I usually change the data disk to a Standard Persistent Disk because of my quota) and click deploy.

That’s it! The deployment takes up to ten minutes, and you can watch the progress by following the on-screen instructions; ssh into the machine, run minikf on the terminal and wait until you have your endpoint and credentials ready.

Provision of MiniKF completed — Image by Author

Now, we are ready to visit the Kubeflow Dashboard. Click on the URL, enter your credentials, and you’re ready to go!

Kubeflow dashboard — Image by Author

Running a Jupyter Server

To run our experiment, we need a Jupyter Notebook instance. Creating a Jupyter Notebook is relatively easy in Kubeflow. We first need to create a Jupyter server and connect to it. Let’s do that:

  1. Choose notebooks from the left panel
  2. Choose the New Server button
  3. Fill in a name for the server and request the amount of CPU and RAM you need
  4. Leave the Jupyter Notebook image as is — this is crucial for this tutorial (jupyter-kale:v0.5.0-47-g2427cc9 — Note that the image tag may differ)
Creating a Jupyter Server — Image by Author

After completing these four steps, wait for the Notebook Server to get ready and connect. You’ll be transferred to your familiar JupyterLab workspace.

Jupyter to ML Pipelines

So, why did we do all this setup? The goal was to transform our Notebook into a production-ready ML pipeline. How can we do that?

In this example, we will use the well-known Titanic dataset to demonstrate a simple workflow we can follow. First, create a new terminal in the JupyterLab environment and clone the example.

git clone https://github.com/dpoulopoulos/medium.git

If you haven’t created a terminal in JupyterLab before note that JupyterLab terminals provide full support for system shells (e.g., bash, tsch, etc.) on Mac/Linux and PowerShell on Windows. You can run anything in your system shell with a terminal, including programs such as vim or emacs. So you can also use it to clone any repo from GitHub.

Cloning the Titanic example in JupyterLab — Image by Author

After cloning the repo, you can find the Titanic example in medium > minikf > titanic.ipynb. You can spend some time going over the Notebook, but there is a crucial step you need to run first: uncomment the first code cell and run it to install the necessary dependencies into your environment.

!pip install --user -r requirements.txt

After running this cell, restart the kernel, and you’re ready to go. If you check the left panel of the Notebook you’ll see a purple icon. This is where the fun begins… Press it to enable the Kale extension. You will automatically see that every cell is annotated.

Enable the Kale extension — Image by Author

You can see that the Notebook comes in sections; the imports, the data loading part, data processing, model training and evaluation, etc. This is precisely what we have annotated with Kale. Now, this Notebook comes pre-annotated, but you can play around. You can create new pipeline steps, but don’t forget to add their dependencies.

In any case, you can just hit the COMPILE AND RUN button located at the bottom of the Kale Deployment Panel. Without writing a single line of code, your Notebook will be transformed into a Kubeflow Pipeline, which will be executed as part of a new experiment.

From Jupyter Notebooks to Kubeflow Pipelines — Image by Author

Follow the link provided by Kale to watch the running experiment. After a few minutes, the pipeline will complete its task successfully. Here is the final view of the graph (don’t forget to toggle the Simplify Graph option on the top left):

The Titanic Pipeline — Image by Author

Congratulations! You have just turned your Notebook to a Pipeline without writing a single line of code, and most importantly, without deviating from routine procedures.

Conclusion

Finally, there is more work you can do on this dataset; you can analyze it more, add cross features or train different classifiers. As a matter of fact, in a future story, we will see how to run hyperparameter tuning without adding any extra lines of code. But achieving the best accuracy on the dataset is not the point of this article.

We saw how we can launch a single-node Kubeflow instance, create a notebook server and convert a simple Jupyter Notebook to a Kubeflow pipeline, without writing any boilerplate code. So, go ahead; implement your own ideas and turn them into ML pipelines with one click! Don’t forget to stop your instance, in the end, to avoid accumulating costs!

About the Author

My name is Dimitris Poulopoulos and I’m a machine learning engineer working for Arrikto. I have worked on designing and implementing AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA.

If you are interested in reading more posts about Machine Learning, Deep Learning, Data Science and DataOps follow me on Medium, LinkedIn or @james2pl on twitter.

Opinions expressed are solely my own and do not express the views or opinions of my employer.