H2O AutoML + Big Data Processing with Apache Spark

Original article was published by Jamshed Khan on Deep Learning on Medium

Big data and machine learning, while two separate concepts, remain interwoven in many aspects. The ability to process vast piles of data for machine learning tasks is a requirement of the field.

Apache Spark is a great framework when it comes to large-scale data processing (and has been for a while), enabling you to work with a range of big data problems. Apart from supporting cluster computing and distributivity with various languages such Java, Scala, and Python, Spark offers support for a variety of ML capabilities via its native libraries. However, its selling point remains its potential for ETL processing with large scale datasets.

On the other hand, H2O is an open source machine learning platform that is centered on scalability. Designed to work with distributed data, it integrates seamlessly with big data frameworks such as Hadoop and Spark to build more efficient ML models.

H2O provides a range of supervised and unsupervised algorithms and an easy-to-use browser based interface in the form of a notebook called Flow. H2O.ai was one of the first to introduce automatic model selection and was named among the top 3 AI and ML solution providers in 2018.

Best of Both Worlds: Sparkling Water

Sparkling Water combines the vast H2O machine learning toolkit with the data processing capabilities of Spark.

It’s an ideal solution for users who need to manage large data clusters and want to transfer data between Spark and H2O. By consolidating these two open-source frameworks, users can query big datasets using Spark SQL, feed the results into an H2O cluster to build a model and make predictions, and then reuse the results in Spark.

The end game here is deploying much more advanced machine learning algorithms with the existing Spark implementation. Results from the H2O pipeline can easily be deployed independently or within Spark, thus offering even more flexibility.

Credit : H2O.ai

Automating ML with H2O

The process of automating machine learning, referred to as AutoML, is now a standard feature across various platforms such as Azure, Google Cloud, and so on. With AutoML, several steps in an end-to-end ML pipeline can be taken care of with minimal human intervention, without affecting the model’s efficiency.

Some of these steps where AutoML proves useful are data preprocessing tasks (augmentation, standardization, feature selection, etc.), automatic generation of various models (random forests, GBM etc.), and deploying the best model out of these generated models.

AutoML is a function of H2O that automates the process of building a large number of models, with the goal of finding the “best” model without any prior knowledge or effort by the data scientist.

The current version of AutoML (in H2O, 3.16) trains and cross-validates a default random forest, an extremely-randomized forest, a random grid of gradient boosting machines (GBMs), a random grid of deep neural nets, a fixed grid of GLMs, and then two stacked ensemble models at the end. One ensemble contains all the models (optimized for model performance), and the second ensemble contains just the best performing model from each algorithm class/family (optimized for production use).