Cage Match: XGBoost vs. Keras Deep Learning

Original article was published on Deep Learning on Medium

Cage Match (illustration by author)

Cage Match: XGBoost vs. Keras Deep Learning

Ever since I had my first taste of deep learning I have been interested in applying it to structured, tabular data. I have written several articles on the subject and I am writing a book on Deep Learning with Structured Data for Manning Publications. It would be great to tackle problems with structured tabular data by harnessing deep learning’s flexibility and potential for reduced feature engineering.

The idea of using deep learning on tabular data is not without its critics. A consistent objection I have heard is that non-deep learning approaches, XGBoost in particular, are simpler to code, easier to interpret, and have better performance. I decided that I needed to put this assertion to the test with the major example from my book: predicting delays on the Toronto streetcar network. The city of Toronto publishes a dataset that describes every streetcar delay since January 2014. The challenge is to use this dataset to train a machine learning model that can predict whether a given streetcar trip will be delayed.

To illustrate the key points of the book, I created a deep learning approach to the streetcar delay prediction problem using a Keras functional model. This solution includes a set of modules to clean up the data, build and train the model, and deploy the trained model. To make a fair comparison between the two machine learning approaches, my goal was to replace the Keras deep learning model with XGBoost with minimal changes to the rest of the code. Imagine that the whole solution, from ingestion of the raw data to deployment of the trained model, is a car. I wanted to replace the car’s engine (the machine learning model) without altering the bodywork, electrical system, interior, or any other aspects of the car.

Swapping out the engine and leaving the rest of the car unchanged (illustration by author)

I was pleasantly surprised by how easy it was to replace the Keras deep learning model with XGBoost. The following sections describe the steps I took to convert the notebook that contains the code to train the Keras model into a notebook that trains an XGBoost model.

Refactor the data used to train and test the model

The deep learning model is a multi-input Keras functional model that expects to be trained on a list of numpy arrays, as shown in the following snippet:

In contrast, the XGBoost model expects to be trained on a numpy array of lists. I needed to convert the training and test data from the format expected by Keras into the format expected by XGBoost. First, I converted the test and train datasets from lists of numpy arrays into lists of lists:

I cringed a bit at using a for loop to do this — I am sure there is a more Pythonic way — but this cell ran quickly enough and I wanted to have code that was easily readable.

Next, I converted each of the lists of lists from the previous step into a numpy array of lists, transposed to get the correct organization of the data:

The output of these transformations is the data in the form we want for XGBoost — a numpy array of lists:

The following diagram shows how the values from the original form of the data (a list of numpy arrays) end up in the target form of the data (a numpy array of lists):

Translation of data from the format required by Keras to the format required by XGBoost

Train and apply the XGBoost model

Now that I had the data in the format required by XGBoost, I was ready to train the XGBoost model. The following snippet shows the code to train and save the model:

I used a single non-default parameter in the XGBoost fit statement: setting scale_pos_weight to one_weight. This parameter let me account for the imbalance in the dataset between the negative case (no streetcar delay) and the positive case (streetcar delay). Only about 2% of the records in the dataset represent streetcar delays. The scale_pos_weight parameter in the XGBoost fit statement is set to a value identical to the value used in the fit statement for the Keras model, where the “1” value of the class_weight parameter is set to one_weight, as shown in the following snippet:

Next, I applied the trained model to the test set and got predictions from the model for the test set.

And finally I assessed the accuracy of the XGBoost model:

Comparing the XGBoost and Keras Results

Now that we have results for the trained XGBoost model, we can compare the overall characteristics of the solution using Keras deep learning with the solution using XGBoost. The following table summarizes the results:

XGBoost vs. Keras result summary

Let’s look at each comparison category in a bit more detail:

  • XGBoost is the winner for performance, especially recall. Recall is critical for the use case of predicting streetcar delays — we want to minimize the model predicting no delay when there is going to be a delay (false negatives). If the model predicts a delay and there is no delay (false positive), the user may end up walking to their destination or taking a taxi instead of a streetcar. The ultimate impact is not that bad because the user still stands a decent chance of getting to their destination on time. However, with a false negative (the model predicts no delay when there is a delay), the impact is bad because the user will likely take the streetcar and risk being late getting to their destination. Thus, recall is critical for the streetcar delay prediction problem, and XGBoost has clearly better recall results.
  • Training time is a draw. On a local system with no GPUs and with a limited number of iterations, XGBoost has a faster training time. However, the training time for Keras varies widely from run to run and is dependent on the callback patience parameter. The patience parameter controls how many epochs the training run continues once the target performance measurement, such as validation accuracy, is no longer improving. Because the training time for Keras varies so much, I am calling this category inconclusive.
  • Code complexity is a draw. The Keras model has more complex code to build the layers of the functional model. However, as shown in the section above on refactoring the data used to train and test the model, XGBoost requires additional code to transform the data into the form that it expects. Because Keras has more complex code to build the model and XGBoost requires additional code to prepare the data, I am also calling this category a draw.
  • Keras is the winner for flexibility. The streetcar delay prediction problem is the subject of the extended example in the book Deep Learning with Structured Data, but the intention is that the code for the streetcar delay prediction problem could be applied to a broad variety of structured tabular datasets. In particular, if a column of the tabular dataset is identified as a free-form text column (for example, a description of an item in a retail site), then the Keras model will be automatically generated with layers to handle such a column. XGBoost does not have this ability to handle tabular datasets with continuous, categorical, and free-form text columns. I am asserting that the Keras approach has superior flexibility because it can handle a wider variety of tabular datasets.


In this article, I have described a comparison of two solutions to the streetcar delay prediction problem: one using XGBoost as the model, and the other using a Keras deep learning model. In this comparison I have kept the code for the two solutions as close as possible; I have only changed the parts of the code specifically related to the training and testing of the model. The results of the comparison show that XGBoost is better “out of the box” on raw performance, especially recall, and that Keras deep learning is more flexible.

Following are links to the code and initial dataset described in this article: