Get Better fastai Tabular Model with Optuna

Source: Deep Learning on Medium

Get Better fastai Tabular Model with Optuna

Note: this post uses fastai v1.0.58 (PyTorch v1.3.0)and optuna v0.17.1.

Introduction

Optuna is a hyperparameter optimization framework applicable to machine learning frameworks and black-box optimization solvers. We can use Optuna with ease in our code by defining a objective function to be optimized. See examples in the repository.

fastai library makes it easier to try deep learning and provides a bunch of latest techniques (best practices) that enable us to obtain competitive models. learn.lr_find for optimal learning rates, learn.fit_one_cycle for superconvergence, and MixUpCallback Callback for the de facto data augmentation. Also, it supports features for data validation (Look at data | fastai) and investigation of trained computer vision models (Computer Vision Interpret [fastai]).

FastAI has three applications, vision, text, and tabular. Fastai focuses on fine-tuning in vision & text as there are a ton of neural network models trained on massive datasets, e.g., ImageNet for vision models and texts collected from the web for language models. Those models are said to have a common sense (I mean they have enough basic knowledge so that they can quickly adapt to new tasks). However, as to the last application, tabular, there are no appropriate datasets for pretraining because it seems almost impossible to define what is a general feature of tabular tasks. If you look for a tabular dataset in Kaggle, there are a bunch of competitions, for example, Instacart, Rossman, and titanic.

Optimize TabularModel for Rossman data

So, I’ll try to get a better TabularModel trained on Rossman dataset than that obtained in fastai’s lecture by letting Optuna find the optimal number of layers and units of each layer, and dropout ratio.

The task is https://www.kaggle.com/c/rossmann-store-sales#. In this competition, it’s expected to create a model that predicts future store sales of the coming six weeks. As you can see in the Data fields, there are both categorical and numerical features. In TabularModel, numerical features are handled as one vector, and each categorical feature is embedded into a vector. This technique is called Entity Embedding. Intuitively, Entity Embedding enables models to learn some useful relationships between instances of categorical features from training dataset. So, TabularModel has some embedding layers and groups of linear (a.k.a. dense), batchnorm, and dropout. The activation function is ReLU.

For those interested in the details of data processing, please see the lesson video. Here, I just use the preprocessing Jupyter notebook to get the same data used in the lecture.

In the original notebook, a model is defined as below.

learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, 
y_range=y_range, metrics=exp_rmspe)

This means that the model has two hidden layers and each layer applies dropout with the ratio of 0.001 and 0.01. Also, it uses dropout with the ratio 0.04 to the concatenated vector of embeddings of categorical features. See the docs for the details. Therefore, I’ll try optuna to find better hyperparameters of

  • the number of layers
  • the number of units each layer has
  • the dropout ratio of each layer
  • the dropout ratio of a concatenated vector

To use optuna in your training scripts, the only thing to do is define objective function which takes optuna.Trial as its input and returns the value to optimize, for instance, accuracy/loss on validation dataset as below gist.

https://gist.github.com/crcrpar/e5fa77c859fb4bd6ddac867c81c50bdf

Model Definition with Optuna

As a reference, in the original notebook, fit_one_cycle was used three times, and each fitting ran for five epochs. All validation `exp_rmspe` were 0.105433, 0.116344, and 0.126323. By optuna optimization with 100 trials, I got 0.102660. The details are below.

Best trial:
Value: 0.10201516002416611
Params:
n_layers: 3
n_units_layer_0: 800
dropout_p_layer_0: 0.1
n_units_layer_1: 900
dropout_p_layer_1: 0.2
emb_drop: 0.1

Faster Optimization with Pruning

While I did get a better model, Optuna ran all the trials (100 trials). Approximately, I ran `100 trials x 5 epochs/trial = 500 epochs`. However, all the trials do not use reasonable hyperparameters due to the randomness of each hyperparameter’s sampling.

So, intuitively, we can do early stopping to some trials with bad hyperparameters to reduce the total time. This early stopping in hyperparameter optimization is Pruning and Optuna supports some strategies for pruning like Successive Halving and callbacks for popular machine learning frameworks such as Keras, MXNet, Chainer, and PyTorch Lightning. See the documentation for the list.

Implement FastAIPruningCallback(TrackerCallback)

In fastai, training and validation loops are abstracted inlearn.fit or learn.fit_one_cycle. Pruning is a variant of EarlyStopping, and the only difference is that pruning is done by optuna.trial.Trial, not Learner. So I implemented the callback as in this PR for Optuna.

By incorporating pruning, the final result might be less competitive than that of a study without pruning because it is almost impossible to predict learning curves precisely. However, the time of optimization should be reduced a lot. And the result is as follows:

Study statistics:
Number of finished trials: 100
Number of pruned trials: 63
Number of complete trials: 37
Best trial:
Value: 0.10323499143123627
Params:
n_layers: 3
n_units_layer_0: 900
dropout_p_layer_0: 0.1
n_units_layer_1: 1100
dropout_p_layer_1: 0.15000000000000002
emb_drop: 0.05

The summary of this post is the below table. Trainings are done with GTX 1080Ti.

https://gist.github.com/crcrpar/cfe439702e73088c841d2e875222c8a0

Also, as the table shows, the total time needed by Optuna is reduced from852to 555 minutes, about 35% reduction.

How Pruning Effects Study Time

The script used in this blog post is https://github.com/crcrpar/fastai-optuna-rossman.