Automation ML Engine By H20.ai

Original article was published by Shubham Nagalwade on Deep Learning on Medium


Automation ML Engine By H20.ai

Source:- Google H2O.ai

Introduction

H2O is an open-source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.H2O’s core code is written in Java. It provides several statistical and ML algorithms including deep learning. The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading.

H2O is used worldwide by more than 18000 organizations and interfaces well with R and Python for your ease of development.

H2O’s REST API allows access to all the capabilities of H2O from an external program or script via JSON over HTTP. The Rest API is used by H2O’s web interface (Flow UI), R binding (H2O-R), and Python binding (H2O-Python).

H2O Architecture

This is where AutoML comes to your rescue. You just have to pick up the algorithm as per the application from its huge repository and apply it to your dataset. It contains the most widely used statistical and ML algorithms.

If you are a Python lover, you may use Jupyter or any other IDE like Google Colab, PyCharm, or of your choice for developing H2O applications. We will also learn how to change the algorithm in your program code and compare its performance with the earlier one. The H2O also provides a web-based tool to test the different algorithms on your dataset. This is called Flow.

Installation

H2O offers an R package that can be installed from CRAN and a python package that can be installed from PyPI. In this article, I shall be working with only the Python implementation. Also, you may want to look at the documentation for complete details.

Some Dependencies:

pip install requests

pip install tabulate

pip install “colorama>=0.3.8”

pip install future

To check if everything is in place, open your Jupyter Notebooks and type in the following:

import h2o

h2o.init()

On executing the cell, some information will be printed on the screen in a tabular format displaying among other things, the number of nodes, total memory, Python version, cluster status, cluster version, cluster allowed core, etc.

#Allocate resourcesh2o.init(nthreads=4,max_mem_size=8)

Requirements

At a minimum, we recommend the following for compatibility with H2O:

Windows 7 or later, OS X 10.9 or later, Ubuntu 12.04, RHEL/CentOS 6 or later.

Scala, R, and Python are not required to use H2O unless you want to use H2O in those environments, but Java is always required. Supported versions include: Java 8, 9, 10, 11, 12, and 13. To build H2O or run H2O tests, the 64-bit JDK is required. To run the H2O binary using either the command line, R, or Python packages, only 64-bit JRE is required. Scala 2.10 or later, R version 3 or later, Python 2.7.x, 3.5.x, 3.6.x.

An internet browser is required to use H2O’s web UI, Flow. Supported versions include the latest version of Chrome, Firefox, Safari, or Internet Explorer.

Read Dataset in H2O:

H2O supports many common data importing formats like Local File System, Remote File, SQL, S3, HDFS, JDBC, and Hive.

Supported Algorithms:

Distributed Random Forest, Linear Regression, Logistic Regression, XGBoost, Gradient Boosting Machine, Deep Learning(Single or Multilayered Perceptron with back-propagation and stochastic gradient descent), K-means Clustering, Principle Component Analysis(PCA), Naïve Bayes, Support Vector Machine (SVM) and word2vec. It also supports stacking and ensembles methods. No worries about except than another algorithm. All ML algorithm can be supported.

Data Manipulation in H2O Frames(Like Pandas and R) :

Supported data manipulations in H2O Frames are Combining Columns from Two Datasets, Combining Rows from Two Datasets, Fill NaN(Null Values), Group By, Imputing Data, Merging Two Datasets, Pivoting Tables, Replacing Values in a Frame, Slicing Columns, Slicing Rows, Sorting Columns, Splitting Datasets into Training/Testing/Validating and Target Encoding.

Supported metrics in H2O:

Metrics are auto-detected based on the type of machine learning problem we are dealing with (Regression or classification).

For Regression-based problems:

R2 Score, Adjusted R2 Score, RMSE(Root Mean Square Error), MSE(Mean Square Error), MAE(Mean Absolute Error), RSE (Residual Standard Error)

For Classification Problems:

AUC and ROC, Accuracy, F1 Score, Confusion Matrix.

H2O’s AutoML feature:

H2O has an automated option for finding the best model for any given data. H2O AutoML feature has a dependency that the pandas module must be installed. This can be invoked by simply calling the H2OAutoML(). The results can be viewed using the tensor-board like a dashboard.

import h2ofrom h2o.automl import H2OAutoML

Initialization using the following command-

h2o.init()
Initialization the H2O.

The initialization shows about computer memory and detailing of H2O.

Productionizing:

At the end of a modeling phase, H2O gives provides functionality to save the models as POJOs or MOJOs.

POJO- Plain old java object

MOJO- Model ObJect

These objects can be used to do predictions in any java installed production environment by writing wrapper classes over it.

Tutorial

import pandas as pdimport collectionsfrom collections import defaultdictfrom sklearn.preprocessing import LabelEncoderimport h2ofrom h2o.automl import H2OAutoMLfrom sklearn.model_selection import train_test_split

Read the Data

data_path = "iris.csv"iris_df = h2o.import_file(path=data_path)

Preparing Dataset

features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']output = 'species'#splitting train_size=80, test_size=20train, test = iris_df.split_frame(ratios=[0.8])

Applying AutoML

The AutoML will run for a fixed amount of time set by us and give us the optimized model.We set up the AutoML function by following command line –

from h2o.automl import H2OAutoMLaml = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 1)

Where,

The first parameter specifies the number of models that we want to evaluate and compare.

The second parameter specifies the time for which the algorithm runs.

We now call the training method on the AutoML object by following the command line –

aml.train(x = features, y = output, training_frame = train)

We specify the x as the features array and y as the output(target) variable to indicate the predicted value and the data frame as the training dataset.

Run the code, you will have to wait for 5 minutes (we set the max_runtime_secs to 300) until you get the following output −

Apply AutoML

Printing A Leaderboard

When the AutoML processing completes, it creates a leaderboard ranking of all the 30 algorithms that it has evaluated. To see the first 10 records of the leaderboard, use the following code −

lb = aml.leaderboardlb.head()
Leaderboard of Top-10 Algorithm