Source: Deep Learning on Medium
Get Better fastai Tabular Model with Optuna
Note: this post uses
fastai v1.0.58 (PyTorch v1.3.0)and
Optuna is a hyperparameter optimization framework applicable to machine learning frameworks and black-box optimization solvers. We can use Optuna with ease in our code by defining a
objective function to be optimized. See examples in the repository.
fastai library makes it easier to try deep learning and provides a bunch of latest techniques (best practices) that enable us to obtain competitive models.
learn.lr_find for optimal learning rates,
learn.fit_one_cycle for superconvergence, and
MixUpCallback Callback for the de facto data augmentation. Also, it supports features for data validation (Look at data | fastai) and investigation of trained computer vision models (Computer Vision Interpret [fastai]).
FastAI has three applications, vision, text, and tabular. Fastai focuses on fine-tuning in vision & text as there are a ton of neural network models trained on massive datasets, e.g., ImageNet for vision models and texts collected from the web for language models. Those models are said to have a common sense (I mean they have enough basic knowledge so that they can quickly adapt to new tasks). However, as to the last application, tabular, there are no appropriate datasets for pretraining because it seems almost impossible to define what is a general feature of tabular tasks. If you look for a tabular dataset in Kaggle, there are a bunch of competitions, for example, Instacart, Rossman, and titanic.
TabularModel for Rossman data
So, I’ll try to get a better
TabularModel trained on Rossman dataset than that obtained in fastai’s lecture by letting Optuna find the optimal number of layers and units of each layer, and dropout ratio.
The task is https://www.kaggle.com/c/rossmann-store-sales#. In this competition, it’s expected to create a model that predicts future store sales of the coming six weeks. As you can see in the Data fields, there are both categorical and numerical features. In
TabularModel, numerical features are handled as one vector, and each categorical feature is embedded into a vector. This technique is called Entity Embedding. Intuitively, Entity Embedding enables models to learn some useful relationships between instances of categorical features from training dataset. So,
TabularModel has some embedding layers and groups of linear (a.k.a. dense), batchnorm, and dropout. The activation function is ReLU.
In the original notebook, a model is defined as below.
learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04,
This means that the model has two hidden layers and each layer applies dropout with the ratio of
0.01. Also, it uses dropout with the ratio
0.04 to the concatenated vector of embeddings of categorical features. See the docs for the details. Therefore, I’ll try optuna to find better hyperparameters of
- the number of layers
- the number of units each layer has
- the dropout ratio of each layer
- the dropout ratio of a concatenated vector
To use optuna in your training scripts, the only thing to do is define
objective function which takes
optuna.Trial as its input and returns the value to optimize, for instance, accuracy/loss on validation dataset as below gist.
Model Definition with Optuna
As a reference, in the original notebook,
fit_one_cycle was used three times, and each fitting ran for five epochs. All validation `exp_rmspe` were 0.105433, 0.116344, and 0.126323. By optuna optimization with 100 trials, I got 0.102660. The details are below.
Faster Optimization with Pruning
While I did get a better model, Optuna ran all the trials (100 trials). Approximately, I ran `100 trials x 5 epochs/trial = 500 epochs`. However, all the trials do not use reasonable hyperparameters due to the randomness of each hyperparameter’s sampling.
So, intuitively, we can do early stopping to some trials with bad hyperparameters to reduce the total time. This early stopping in hyperparameter optimization is Pruning and Optuna supports some strategies for pruning like Successive Halving and callbacks for popular machine learning frameworks such as Keras, MXNet, Chainer, and PyTorch Lightning. See the documentation for the list.
In fastai, training and validation loops are abstracted in
learn.fit_one_cycle. Pruning is a variant of
EarlyStopping, and the only difference is that pruning is done by
Learner. So I implemented the callback as in this PR for Optuna.
By incorporating pruning, the final result might be less competitive than that of a study without pruning because it is almost impossible to predict learning curves precisely. However, the time of optimization should be reduced a lot. And the result is as follows:
Number of finished trials: 100
Number of pruned trials: 63
Number of complete trials: 37
The summary of this post is the below table. Trainings are done with GTX 1080Ti.
Also, as the table shows, the total time needed by Optuna is reduced from
555 minutes, about 35% reduction.
The script used in this blog post is https://github.com/crcrpar/fastai-optuna-rossman.