Polyaxon v0.4.2: Kubeflow, Horovod, and MPI integrations

Source: Deep Learning on Medium


Go to the profile of Mourad

Polyaxon is a platform for managing the whole life cycle of machine learning (ML) and deep learning(DL).

Kubeflow Integration

Today, we are pleased to announce the 0.4.2 release, this release not only brings some new functionalities to the platform, it also answer a question that we get asked several times: what is the difference between Polyaxon and Kubeflow?

Kubeflow is an open, community driven project to make it easy to deploy and manage an ML stack on Kubernetes

Polyaxon now has a support of Kubeflow’s operators, we believe that both platforms complement each other, and we are working on making an integration with this community effort as painless as possible. With this integration, users will have a choice to schedule distributed experiments using our native behaviour or they can leverage Kubeflow’s operators, i.e. TFJob, PytorchJob, and MPIJob in an easy way.

At the moment, Polyaxon supports and simplifies distributed training using the following frameworks: Tensorflow, MXNet, Pytorch, and Horovod in a native mode. Additionally, with this new integration, users will be able to, with very minimal changes, run distributed experiments using Kubeflow as a backend.

In order to start a distributed experiment on Kubeflow, users will only need to specify an extra parameter in their polyaxonfiles: backend, e.g.

version: 1
kind: experiment
backend: kubeflow
framework: tensorflow
...

This change will tell Polyaxon to schedule a TFJob instead of scheduling a native Polyaxon experiment. Once the experiment is started by the TFJob operator, the user will be able to track, monitor, run tensorboard, compare experiments, search, and visualize the experiment on Polyaxon’s dashboard and see it on the Polyaxon CLI.

Helm charts for TFJob, PytorchJob, MPIJob

In addition to this new integration, and after checking with the Kubeflow community, we are releasing, and we will be maintaining, a set of Helm charts for these operators, we are additionally distributing them on https://charts.polyaxon.com for ease of deployment.

Teams already using Kubeflow will not need to deploy these charts(unless the version they are using is not supported yet or not supported anymore), since Polyaxon will be able to schedule experiments if a Kubeflow instance is already deployed. These charts are meant to be the minimum requirement for users who want to try the integration, and want a similar deployment process as Polyaxon with Helm.

MPI as a backend for distributed experiments

Several ML/DL practitioner use nvidia gpus to run experiments on Polyaxon, the MPI operator integration will give them an additional option to run an all-reduce-style distributed training using the MPIJob.

To use MPI as a backend, users will need to install the MPIJob, they can use the Helm chart produced by Polyaon:

helm install polyaxon/mpijob --name=plxmpi --namespace=polyaxon

And to delete it

helm install del plxmpi --purge

And need to specify the mpi as a backend in their polyaxonfiles, i.e.

version: 1
kind: experiment
backend: mpi
....

Horovod native distributed experiments

In addition to the MPI backend that depend on the MPIJob integration, we are also releasing a native Horovod behaviour to start distributed experiments.

version: 1
kind: experiment
framework: horovod

Better Schema validation for Polyaxonfiles

This version also brings a new enhancement that several users were asking for, better validation of nested fields in polyaxonfiles, i.e. detecting non valid fields, or issues in subsection.

Polyaxon deploy is now in Public beta

Polyaxon CLI can be used to check, deploy, upgrade, and teardown a Polyaxon deployment, and can be used with any underlying infrastructure: Kubernetes, docker, docker-compose, …

Several times when our Helm chart get updated, some fields might disappear or change typing, and these changes go undetected, users start noticing undesired behaviour after upgrading, and it’s very hard to debug the cause and how to fix it.

Using the CLI, users can know validate their config deployment file before upgrading/installing Polyaxon:

polyaxon deploy -f config.yaml --check

To deploy

polyaxon deploy -f config.yaml

To upgrade

polyaxon deploy -f config.yaml --upgrade

To teardown

polyaxon teardown

The teardown will also prompt users to decide if they want to execute post-delete cleanup hooks, in general if the deployment is initially invalid, the users can opt-out of these hooks to avoid having failing jobs.

This feature is in Beta, and users can always deploy Polyaxon using Helm if they want, we will start encouraging users to use the CLI, because we will be rolling out several enhancement to alert about upgrades with migrations, data migrations, possibility to backup the database, …

Other improvements and bug fixes

We fixed several issues and regression, one notable issue was related to downloading outputs/artifacts when using GCS as a storage backend. we also added the possibility to download individual files from the UI.

Another issue was related to pod initializing and other pending states that were not correctly detected in some instances. This issue was particularly confusing when running hyperparameters’ experiment groups, users would see a number of running experiments less than the concurrency specified, although the number of experiments scheduled is correct it was not reflected correctly in the dashboard, because the statuses were Unknown condition.

Experiment Jobs information

Experiment jobs in Polyaxon represent the underlying Pod running on Kubernetes, they are now clickable and users can view not only the aggregated experiment statuses but also those of each individual pod, as well as the logs of those pods (other information will surface in the following releases). This type of information is useful when running distributed experiments.

Conclusion

This release brings also several other UI and internal fixes and improvements.

Next release we will be pushing several other UI enhancements, notably the possibility to start, restart, and resume experiments/jobs/builds/notebooks/tensorboards directly from the dashboard. We also made several fixes to our Machine Learning CI tool and we will be releasing a public Beta with documentation in the next release or the one after.

Polyaxon will keep improving and providing the simplest machine learning layer on Kubernetes. We hope that these updates will improve your workflows and increase your productivity, and again, thank you for your continued feedback and support.