AutoGluon: Deep Learning AutoML

Source: Deep Learning on Medium

AutoGluon: Deep Learning AutoML

Authors: Nick Erickson, Jonas Mueller, Hang Zhang, Balaji Kamakoti

Thanks to Aaron Markham, Mu Li, Matthias Seeger, Talia Chopra, and Sheng Zha for their early feedback and edits.

Introducing AutoGluon

AutoGluon is a new open source AutoML library that automates deep learning (DL) and machine learning (ML) for real world applications involving image, text and tabular datasets. Whether you are new to ML or an experienced practitioner, AutoGluon will simplify your workflow. With AutoGluon, you can develop and refine state-of-the-art DL models using just a few lines of Python code. In this post, we explain the benefits of AutoGluon, demonstrate how to install it on Linux, and get started using AutoGluon to solve real-world problems with state-of-the-art performance within minutes.

Motivation and Key Features

Historically, achieving state-of-the-art ML performance required extensive background knowledge, experience, and human effort. Data preparation, feature engineering, validation splitting, missing value handling, and model selection are just a few of the many tasks that must be addressed in ML applications. One particularly difficult task is the selection of hyperparameters.

Hyperparameters represent the many choices that must be made by the user when constructing a model, such as the data processing steps, neural network architecture, and the optimizer used during training. Each hyperparameter affects the predictive performance of the resulting model in an opaque fashion, and more powerful models (like deep neural networks) have increasingly more hyperparameters to tune. Slight hyperparameter modifications may significantly alter the model quality. As it’s usually unclear how to make these decisions, developers typically manually tweak various aspects of their ML pipeline in order to achieve strong predictive performance in practice, which can take many iterations and painstaking human effort.

AutoGluon automates all of the previously mentioned tasks, creating a truly hands-off-the-wheel experience. Rather than spending your own valuable time managing these experiments or learning how to do them in the first place, you can simply specify when you’d like to have your trained model ready, and AutoGluon will leverage the available compute resources to find the strongest ML methods within its allotted run-time.

AutoGluon enables you to automatically achieve state-of-the-art performance on tasks such as image classification, object detection, text classification, and supervised learning with tabular datasets. The hyperparameters of each task are automatically selected using advanced tuning algorithms such as Bayesian Optimization, Hyperband, and Reinforcement Learning. With AutoGluon, you don’t have to have any familiarity with the underlying models, as all hyperparameters will be automatically tuned within default ranges that are known to perform well for the particular task and model.

For expert ML practitioners, AutoGluon allows this process to be easily customized. For example, you can specify ranges of values to consider for certain hyperparameters, and also use AutoGluon to automatically tune various aspects of your own custom models. If you have access to multiple machines AutoGluon can easily distribute its computation across them in order to return trained models more quickly.

AutoGluon by Example

Installation

Before training models, you will have to install AutoGluon. AutoGluon is supported on Linux, with MacOS and Windows support coming soon. AutoGluon can be installed by following the installation instructions here.

For this demonstration, a Linux machine with GPU was used. To install AutoGluon for Linux with GPU support, run the following commands in terminal or refer to the installation wiki for CPU only installation:

# CUDA 10.0 and a GPU for object detection is recommended
# We install MXNet to utilize deep learning models
pip install --upgrade mxnet-cu100
pip install autogluon

Object Detection Example

We adopt the task of object detection as an example to demonstrate AutoGluon’s simple interface. In object detection, one aims to not only identify objects in an image, but also localize them with a bounding box.
We will use AutoGluon to train an object detector on a small toy dataset created for demo purposes (to ensure quick runtimes). The dataset was generated using the motorbike category of the VOC dataset[1]. In the below Python code, we first import AutoGluon, specify object detection as the task of interest (ObjectDetection as task), download the data onto our machine, and finally load it into Python:

import autogluon as ag
from autogluon import ObjectDetection as task
url = 'https://autogluon.s3.amazonaws.com/datasets/tiny_motorbike.zip'
data_dir = ag.unzip(ag.download(url))
dataset = task.Dataset(data_dir, classes=('motorbike',))

Next, we can train a detector model using AutoGluon by simply calling the fit() function:

detector = task.fit(dataset)

In this single call to fit(), AutoGluon trains many models under different network configurations and optimization hyperparameters, selecting the best of them as the final detector to return. Without any user input, the call to fit() also automatically utilized state-of-the-art deep learning techniques such as transfer learning of a pre-trained YOLOv3 network. We can test the trained detector on a new image using the predict() method:

url = 'https://autogluon.s3.amazonaws.com/images/object_detection_example.png'
filename = ag.download(url)
index, probabilities, locations = detector.predict(filename)

AutoGluon’s predict function automatically loads the test image and outputs the predicted object category, class-probability, and bounding box location for each detected object. A visualization image is automatically generated as shown above. We can see the motorbikes are detected and localized with reasonable accuracy, despite only training our detector on a very small dataset. For a full tutorial on using AutoGluon for object detection, please visit the AutoGluon website.

Tabular Data Example

The most commonly encountered form of data is tabular datasets. These are comprised of structured data usually found in a comma separated file (CSV) or a database. In tabular datasets, each column represents the measurements of some variable (a.k.a. feature), and the rows represent individual data points. AutoGluon can be used to train models that predict a particular column’s value based on the other columns in the same row, and are able to generalize to previously unseen examples.

The dataset we will be training on is the Adult Income Classification dataset[2]. This dataset contains information about ~48,000 individuals including numeric features such as age and categorical features such as occupation. The dataset is often used to predict an individuals’ income. In this example, we will predict if an individual earns more than $50,000 per year. We will use 80% of the data for training AutoGluon and 20% of the data to test the resulting AutoGluon predictor. With AutoGluon, there is no need to specify validation data. AutoGluon will optimally allocate a validation set using the training data provided.

As an example, we provide Python code that first imports AutoGluon and specifies a task where we will work with tabular data using TabularPrediction. Then we load the Dataset from a CSV file hosted on S3. With just a single call to fit(), AutoGluon processes the data and trains a diverse ensemble of ML models called a “predictor” which is able to predict the “class” variable in this data. It will use the other columns as predictive features, such as the individuals’ age, occupation, and education. This ensemble of models includes tried and tested algorithms famous within the ML competition community for their quality, robustness and speed such as LightGBM, CatBoost, and Deep Neural Networks that consistently outperform more traditional ML models such as logistic regression.

Note that we don’t need to do any data processing, feature engineering, or even declare the type of prediction problem. AutoGluon automatically prepares the data and infers whether our problem is regression or classification (including whether it is binary or multiclass). The trained predictor model will be saved to the location specified in the task.fit() call.

from autogluon import TabularPrediction as task
train_path = 'https://autogluon.s3.amazonaws.com/datasets/AdultIncomeBinaryClassification/train_data.csv'
train_data = task.Dataset(file_path=train_path)
predictor = task.fit(train_data, label='class', output_directory='ag-example-out/')

Now that our predictor model is trained we will make predictions on previously unseen test data. We can either directly use the returned predictor or load it from the output directory we specified.

predictor = task.load('ag-example-out/')
test_path = 'https://autogluon.s3.amazonaws.com/datasets/AdultIncomeBinaryClassification/test_data.csv'
test_data = task.Dataset(file_path=test_path)
y_test = test_data['class']
test_data_nolabel = test_data.drop(labels=['class'],axis=1)
y_pred = predictor.predict(test_data_nolabel)
y_pred_proba = predictor.predict_proba(test_data_nolabel)
print(list(y_pred[:5]))
print(list(y_pred_proba[:5]))

[‘ <=50K’, ‘ <=50K’, ‘ >50K’, ‘ <=50K’, ‘ <=50K’]
[0.077471, 0.0093894, 0.973065, 0.0021249, 0.001387]

Now we will take a look at the model leaderboard:

leaderboard = predictor.leaderboard(test_data)
AutoGluon’s model leaderboard

This leaderboard shows each of the models trained by AutoGluon, their scores on the test and validation data, and training time in seconds. As can be seen, the weighted_ensemble performed the best on both validation and test sets, achieving an accuracy of 87.76%, a very strong result[3] for this problem.

For a full tutorial on using AutoGluon for supervised learning with tabular data, please see the AutoGluon Tabular Prediction tutorials.

Learn more and contribute

In this post, we introduced AutoGluon, our humble effort to offer the best ML and deep learning experience for both ML experts and newcomers. This library is intended not only to be trivial to use, but also to enable high-quality models that outperform other ML methods across diverse applications. While this post focused on object detection and prediction with tabular data, AutoGluon can be applied just as easily for other tasks including text and image classification. AutoGluon can even be used to refine arbitrary ML tasks involving custom-built models (in both MXNet and PyTorch).

We welcome the community’s participation in our journey. Head over to the AutoGluon GitHub repository to get started, and check out the tutorials on the AutoGluon website to quickly try out sophisticated AutoML solutions in your applications. We are eager to hear your results and feedback!

Citations

[1] Everingham, Mark, et al. “The pascal visual object classes challenge: A retrospective.” International journal of computer vision 111.1 (2015): 98–136.
[2] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[3] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: networked science in machine learning. SIGKDD Explorations 15(2), pp 49–60, 2013.

Resources

AutoGluon Website
AutoGluon Github Repository
Dive into Deep Learning
MXNet Gluon