How data prep packages can introduce risk to your ML modeling

Original article was published on Artificial Intelligence on Medium


How data prep packages can introduce risk to your ML modeling

A case study on “vtreat”, an automated variable treatment package (Python)

Photo by Stephanie Harvey on Unsplash

Real-world data is often “dirty” or “messy.” Data scientists building machine learning (ML) models will frequently encounter data quality problems such as “bad” numeric values (e.g. None, NaN, or NA), missing values, and high-cardinality categorical features (e.g. ZIP code).

Any data scientist with such a dirty and messy dataset wanting to leverage their arsenal of ML packages like scikit-learn or xgboost to build the most performant model would have to tediously clean the dataset and impute missing values to get it in the ideal format needed by those powerful packages.

Fortunately, there are convenient data preparation (a.k.a. data prep) packages to automate and simplify the data cleansing and data quality handling. These packages provide nice APIs for a smooth UX for data science workflows. However, while certain packages seem to provide nice “one-liner” solutions to the problem of data prep, as the saying goes, there’s no such thing as a free lunch. In certain edge cases of data prep, you may in fact be introducing new risk to your model, such as overfitting or target leakage (a.k.a. data leakage).

The following is a brief case study on one such data prep package known as vtreat [1], created by Dr. John Mount and Dr. Nina Zumel from Win Vector LLC [2]. vtreat is available in R and Python in addition to having a research publication [3]. This article will focus on the Python package [4].

vtreat logo by Win Vector LLC

What does vtreat do and how should I use it?

As the authors describe, vtreat is a “dataframe processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.” The high-level goal of vtreat is to “faithfully and reliably convert data into a ready-for-ML dataframe that is entirely numeric and without missing values.” The specific data quality handling methods that vtreat includes are:

  • Fix bad/missing numerical values
  • Encode categorical levels as indicators
  • Manage high-cardinality categorical variables (a.k.a. “impact coding”)

The last method of “impact coding” (Kagglers may know this as “target coding”) can be particularly helpful since it converts a categorical variable with many levels into an informative numeric feature. In short, it’s calculated by first getting the global average value of the target variable (hence “target coding”) across all rows, then the value of the target for a given categorical variable is checked against this global average, thus resulting in a “numerical impact value” for the categorical variable. This preserves a good amount of predictive information from the original variable while being computationally efficient. So let’s walk through how to properly use “impact coding”.

A data scientist who is familiar with .fit() and .transform() methods from related ML packages may use vtreat in the following manner to create a prepared dataframe to then be used in modeling:

import vtreat
transform = vtreat.BinomialOutcomeTreatment(outcome_target=’y’)
df_prepared = transform.fit(df_train, df[‘y’]).transform(df_train)

Careful! This is a naive use of vtreat, since impact coding requires knowing what the outcome is in order to generate the “impact value”. The resulting model fitted to this prepared dataframe could lead to undesirable model bias, basically where the model overfits to the impact coded variable.

In order to circumvent this, as the authors of vtreat carefully described in their documentation, it’s very important to set aside a partition of the full dataset (i.e. holdout/calibration set) to be used for impact coding, then run the “calibrated impact coder” on the training partition to prepare data, followed by fitting a model to the resulting prepared dataframe. The following illustrates how to correctly use vtreat with holdout/calibration set:

import vtreat
transform = vtreat.BinomialOutcomeTreatment(outcome_target='y')
df_prepared = transform.fit(df_train_holdout, df['y']) \
.transform(df_train_model)

Furthermore, the authors have ensured correct cross-validation methods are used by leveraging a novel technique they call a “cross-frame” [5] (i.e. cross validated training frame). In the end, the authors recommend the following calling pattern by using the built-in guardrail method .fit_transform() when using vtreat to prepare data for ML modeling:

import vtreat
transform = vtreat.BinomialOutcomeTreatment(outcome_target='y')
df_prepared = transform.fit_transform(df, df['y'])

The resulting dataframe is completely numeric, while preserving as much relevant modeling information from the training variables as possible. This allows a data scientist to use any off-the-shell ML package and rapidly build potentially more accurate models than if using only the raw dataset.

Do you trust your data prep?

After all is said and done, vtreat helps save a lot of time and effort for data scientists by automatically preparing a structured numeric-only dataframe for modeling. Let’s put this to the test on an example dataset where the target variable to predict is the sale price of a manufacturer’s machine.

The original dataset of already-curated informative features has 57 columns. After running the dataset through vtreat and specifying the NumericOutcomeTreatment() method, the prepared dataframe ended up having over 230 columns. I built two AutoML projects, one using the original dataset and the other the vtreat-prepped dataset; the former resulted in an R2 holdout score of ~0.91 while the latter resulted in an R2 holdout score of ~0.86. That is, for this particular example dataset, the prepared dataset gave a poorer model fit score.

Acknowledgement: the AutoML tool used was the DataRobot Platform, which comes with built-in preprocessing and feature engineering.

Some reasons why the prepared dataset ended up with worse model fit scores could be due to some possible overfitting to the impact-coded variables (despite using the built-in .fit_transform() method), newly introduced target leakage from existing leaks, or redundant generated features.

In summary, both the authors and myself strongly advise data scientists to still get their hands dirty and investigate not just the underlying raw data but also the prepared data from data prep packages. You must always pay attention to the context of the data, both in terms of data provenance and how the predicted data will be applied, especially if the models will be used for real-world considerations.

References

  1. https://github.com/WinVector/vtreat
  2. https://win-vector.com/
  3. Zumel, N.B., & Mount, J. (2016). vtreat: a data.frame Processor for Predictive Modeling. arXiv: Applications. (https://arxiv.org/abs/1611.09477)
  4. https://pypi.org/project/vtreat/
  5. https://cran.r-project.org/web/packages/vtreat/vignettes/vtreatCrossFrames.html