Grid Search in H2o

Original article was published on Artificial Intelligence on Medium

Implementation of Cartesian and Random Grid search in H2o.ai

Source: Google

What’s the need?

Every ML model has some set of parameters (for example, coefficients in logistic regression) that it learns from the data and some hyperparameters that we must pass to the model before training. Some examples of hyperparameters include:

  • Random Forest : no_of_trees, max_depth
  • Support Vector Machines: gamma, rank_ratio
  • K-means: k, max_iterations

While parameters are something whose values we derive out of the given dataset automatically, model hyperparameters are set manually (and often tuned). It is important that we select good hyperparameter values for each model in order to effectively use them.

An optimal combination of hyperparameters maximizes a model’s performance without leading to a high variance problem (overfitting).

Grid search is a hyperparameter optimization technique i.e. a technique used for finding the most optimal set of hyperparameters which results in the ‘best’ model. Best can mean different things depending on the context of the model: for instance, it can mean the most accurate model for classification problems whereas for a regression problem the best model can be the one with lowest RMSE.

Grid search and its types

A grid search uses brute force to train a model for every combination of hyperparameter values. For instance, if you have three hyperparameters H1, H2, and H3, and each of them can take on 20, 30, and 15 values, respectively, your grid will contain a total of 20 * 30 * 15 = 900 models.

In this article, we will be discussing two main kinds of Grid search that H2o supports: Cartesian Grid Search and Random Grid Search.

  • Cartesian Grid Search: The default grid search method in H2o, exhaustively searches over all possible combinations of the hyperparameters. If the search space is small, this should be your method of choice.
  • Random Grid Search: As the name suggests, a random combination of hyperparameters (sampled uniformly from the set of all possible hyperparameter value combinations) are tested instead of exhaustively testing all possible combinations. In addition, a stopping criterion is also set to specify when the random search must be stopped. If your search space is large, this should be your method of choice.

While the cartesian grid search is guaranteed to find the most optimal combination of hyperparameters, it often comes with a high computational cost. On the other hand, the random grid search decreases the computational time but might not provide the most optimal results.

Much like the bias-variance trade-off, there is a classic speed-accuracy trade-off involved when it comes to choosing the best grid search method.

Let’s dive into the data

The data comes from a fintech company wherein the goal is to build a model that can predict the loan application outcome (approved vs rejected) based on certain predictors. Hence, we are dealing with a binary classification problem. This is how the data looks like:

Dataset

The data is already cleaned (I prefer to do the cleaning in R and model building in Python) and ready to be used. In addition to the outcome variable i.e. application_outcome, we are concerned with the following predictors:

  • age: age of the applicant
  • car_type: type of the car for which loan is applied
  • Loanamount: loan amount applied for
  • Deposit: deposit that the applicant is willing to pay
  • area: based on the applicant’s postcode

Since we will be tuning the hyperparameters, it is wise to separate the train, validate, and test frames from each other so as to avoid accidental data leakage. The way we use the three frames is:

  • create several rivaling models using the train set with different hyperparameter combinations.
  • select the ‘best’ model by testing on the validate frame.
  • unbiased evaluation of the ‘best’ model on the unseen test frame.

Let’s get to coding

Setting up a benchmark model

We begin by creating a simple Logistic Regression model without any hyperparameter tuning so we can compare its results with the final model where hyperparameter tuning has been achieved using grid search methods.

Upon evaluating the model on the test set, we obtain an AUC = 0.76

Model metrics

While this is a decent AUC, to begin with, let us see if we can improve upon this further. To get the best possible logistic regression model, we need to find the optimal values of two hyperparameters: alpha (learning rate) and lambda (regularization parameter).

Cartesian Grid Search

  1. Defining the search space
param = { 'alpha': [x * 0.01 for x in range(0,11)]
}

If you notice, we haven’t provided lambda in the grid parameters because h2o has inbuilt automatic tuning to find the best lambda value. Thus, we can find the optimal value for lambda automatically by setting lambda_search = True. Since we want to specify this non-default model parameter that is not part of our grid, we pass them along to the grid via the H2OGridSearch.train() method (see below)

2. Initializing the grid-search instance

# Import H2O Grid Search: 
from h2o.grid.grid_search import H2OGridSearch
h20_grid = H2OGridSearch(
model = H2OGeneralizedLinearEstimator(family = 'binomial'),
hyper_params = param,
search_criteria = {'strategy': "Cartesian"},
grid_id = 'glm_grid1'
)

3. Training the grid

h20_grid.train(
x = x,
y = y,
training_frame= train,
validation_frame=validate,
lambda_search = True # model parameter than we want to prefix!
)

4. Finally getting the best hyperparameters for our model

h20_grid.get_grid(sort_by='auc', decreasing=True)
Model metrics

Thus, we have the best model with AUC = 0.796 when alpha = 0.1, which is a modest improvement over our previous AUC score. But remember, these results are based on our validation test set. The sanity check using the test set will be done in a few minutes.

Random Grid Search

While cartesian grid search suffices for our case since we have only one parameter we want to tune, we will still showcase how random grid search can be carried out. Thus, we modify our previous search space to increase its size, i.e. alpha can now take on any value between 0 and 0.9 whereas earlier it could only take values between 0 and 0.1.

params = {
'alpha': [x * 0.01 for x in range(0,99)]
}

The remaining steps remain the same as before, except for a small variation. When defining the grid search instance, we must now explicitly state the strategy as “RandomDiscrete”. In addition, we have also mentioned stopping criteria according to which the maximum number of models that will be generated in the grid will be equal to 30.

search_criteria = {'strategy': 'RandomDiscrete',
'max_models': 30 # max of 30 models assessed
}
# creating the grid of GLM
h2o_grid2 = H2OGridSearch(
model = H2OGeneralizedLinearEstimator(family = 'binomial'),
hyper_params = params,
search_criteria = search_criteria,
grid_id = 'glm_grid2')

Training and assessing the performance of this grid reveals that the best model outputs an AUC = 0.797 with alpha = 0.73.

Model metrics

P.S. There will be 30 models in the output. We present here only a subset of our results.

Comparison of the two grid search techniques

We save the top model from each grid and evaluate them using the held-out test frame.

# Assessing the best model obtained from cartesian grid search best_cart_model.model_performance(test).auc()

We obtained AUC = 0.8030682697458484

# Assessing the best model obtained from random grid search
best_random_model.model_performance(test).auc()

We obtained AUC = 0.8034591194968553

We have improved upon our initial (untuned) model by using grid search techniques and increased AUC from 0.76 to 0.80.

As such there is no difference between the AUC values from cartesian vs random grid search), however, we do obtain a higher alpha value from random search (alpha = 0.78) without any loss in AUC value. Since alpha represents the learning rate (or how quickly or slowly a model learns a problem i.e. model converges) and we know a smaller alpha can lead to slower training, it would make more sense to pick larger alpha value (as both high and low alpha values converge on the same solution).

Bonus Section

There are smarter ways in which we can define the search criteria which can be useful when we are searching over many different hyperparameters and not just one like in our case.

Previously, we were simply using the max_models parameter to stop the random search process as soon as 30 models were generated. We can also stop grid-searching the best hyperparameter values after a model’s AUC value does not improve by 1e-3 (or 0.003) for three rounds of scores.

search_criteria = {
'strategy': 'RandomDiscrete',
'stopping_metric': 'AUC',
'stopping_tolerance': 1e-3,
'stopping_rounds': 3
}

Resources

Hopefully, this tutorial is a good starting point for you to play around with the hyperparameters in your models and improve their accuracy. You can find the full code here on Github.

Here is a link for all the hyperparameters you can grid search over in H2o, depending on your ML model.

Happy learning 🙂