Hyperparameter Tuning to Reduce Overfitting — LightGBM

Original article was published by Soner Yıldırım on Artificial Intelligence on Medium


Hyperparameter Tuning to Reduce Overfitting — LightGBM

Demonstrated with examples

Photo by Andrés Dallimonti on Unsplash

Easy access to an enormous amount of data and high computing power has made it possible to design complex machine learning algorithms. As the model complexity increases, the amount of data required to train it also increases.

Data is not the only factor in the performance of a model. Complex models have many hyperparameters that need to be correctly adjusted or tuned in order to make the most out of them.

For instance, the performance of XGBoost and LightGBM highly depend on the hyperparameter tuning. It would be like driving a Ferrari at a speed of 50 mph to implement these algorithms without carefully adjusting the hyperparameters.

In this post, we will experiment with how the performance of LightGBM changes based on hyperparameter values. The focus is on the parameters that help to generalize the models and thus reduce the risk of overfitting.

Let’s start with importing the libraries.

import pandas as pd
from sklearn.model_selection import train_test_split
import lightgbm as lgb

The dataset contains 60 k observations, 99 numerical features, and a target variable.

(image by author)

The target variable contains 9 values which makes it a multi-class classification task.

Our focus is hyperparameter tuning so we will skip the data wrangling part. The following code block splits the dataset into train and test subsets and converts them to a format suitable for LightGBM.

X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=42)lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test)

We will start with a basic set of new hyperparameters and introduce new ones step-by-step.

params = {
'boosting_type': 'gbdt',
'objective': 'multiclass',
'metric': 'multi_logloss',
'num_class':9
}

We can now train the model and see the results based on the specified evaluation metric.

gbm = lgb.train(
params,
lgb_train,
num_boost_round=500,
valid_sets=[lgb_train, lgb_test],
early_stopping_rounds=10
)

The evaluation metric is multi-class log loss. Here is the result of both training and validation sets.

(image by author)

The number of boosting rounds is set as 500 but early stopping occurred. The early_stopping_rounds stops the training if the performance does not improve in the specified number of rounds.

It seems like the model is highly overfitting to the training set because there is a significant difference between losses on training and validation sets.

The min_data_in_leaf parameter is a way to reduce overfitting. It requires each leaf to have the specified number of observations so that the model does not become too specific.

'min_data_in_leaf':300 #added to params dict
(image by author)

The validation loss is almost the same but the difference got smaller which means the degree of overfitting reduced.

Another parameter to prevent the model from being too specific is feature_fraction which indicates the ratio of features to be randomly selected at each iteration.

'feature_fraction':0.8 #added to params dict

Now the model uses 80% of the features at each iteration. Here is the result.

(image by author)

The overfitting further reduced.

Bagging_fraction allows using a randomly selected sample of rows to be used at each iteration. It is similar to feature_fraction but for rows. The bagging_freq specifies the iteration frequency to update selected rows.

#added to params dict
'bagging_fraction':0.8,
'bagging_freq':10
(image by author)

The difference between train and validation losses is decreasing which indicates we are on the right track.

LightGBM is an ensemble method using boosting technique to combine decision trees. The complexity of an individual tree is also a determining factor in overfitting. It can be controlled with the max_depth and num_leaves parameters. The max_depth determines the maximum depth of a tree while num_leaves limits the maximum number of leafs a tree can have. Since LightGBM adapts leaf-wise tree growth, it is important to adjust these two parameters together.

Another important parameter is the learning_rate. The smaller learning rates are usually better but it causes the model to learn slower.

We can also add a regularization term as a hyperparameter. LightGBM supports both L1 and L2 regularizations.

#added to params dict
'max_depth':8,
'num_leaves':70,
'learning_rate':0.04
(image by author)

We’ve further decreased the difference between the train and validation loss which means less overfitting.

The number of iterations is also an important factor in model training. The more iterations cause the model to learn more and thus the model starts overfitting after a certain amount of iterations.

You may need to spend a good amount of time tuning the hyperparameters. Eventually, you will create your own way or strategy that will expedite the process of tuning.

There are lots of hyperparameters. Some are more important in terms of accuracy and speed. Some of them are mainly used to prevent overfitting.

Cross-validation can be used to reduce overfitting as well. It allows using each data point in both training and validation sets.