How time can ruin your most precious machine learning model

Original article can be found here (source): Artificial Intelligence on Medium

You learn a lot of things in the typical 101 machine learning course. You will learn different algorithms, feature engineering, methods like cross-validation to get reliable performance measures and how you can tune the hyper parameters of your algorithm. However, most of these introductory courses will not tell you about all the things that can go wrong when your data depends on time.

Now, you might think this does not apply to your problem and only applies to the typical forecast of an economic time series. You might be wrong. A large part of real-world tasks the average data scientist solves today do depend on time.

Whenever you come across machine learning models underperforming once they are deployed, you should ask if and how effects depending on time were taking into account when developing the model.

Basic example: predicting fuel consumption

Let us start with a simple example. We have the task to predict the fuel consumption of cars during design & development for the R&D department. The customer provided us data for a lot of different cars and we discovered that engine’s capacity, transmission type and fuel type are important features for predicting the car’s consumption.

engine capacity, fuel type and transmission type are the most important features
fuel consumption increases with engine capacity

We validated our XGBoost model using k-fold cross validation and got a superb mean percentage error (MPE) of about -0.95%. Everyone is excited and the model is deployed to users of the R&D department.

However, soon users start to complain that the predicted consumption values cannot be true and are unrealistically high.

The fuel consumption data depends on time

It turns out that our data contains training examples from more than 10 years. Once we add the time component of this data generating process, it becomes obvious that there has been substantial progress in constructing fuel-efficient cars in these years. However, our first model could not learn this progress as corresponding features are missing.

average fuel consumption has decreased over the years

We correct the validation scheme to take time into account and replace the k-fold cross validation with a rolling origin forecast resampling. Using the adjusted validation scheme we find that the users were right. Our model’s mean percentage error (MPE) is actually 8 times worse than we thought it was. The model is making heavily biased predictions.

the rolling origin validation yields an MPE 8 times worse than the k-fold validation (-8.18% vs. -0.95%)

Now, that we know that our data depends on time, we can create features that enable the machine learning algorithm to make less biased predictions.

Less obvious examples where time might matter

The fuel consumption example above was quite obvious and many experienced data scientists would not fall into this trap.

However, sometimes there are cases that are less obvious and it is not that easy to spot the time dependency. Could even image classification depend on time? Sure, two examples:

  • Automated quality control using a CCD sensor
    The training data consists of 45.000 images that were generated during the evenings of January with the shop floor lighted by fluorescent tubes. Once the model will be deployed it will have to operate on images taken on a morning of June with sun shining through a roof window.
Photo by Adrien Olichon on Unsplash
  • Assisting clinicians during radiographic staging
    Different typical activities during summer and winter lead to different injuries. How does this influence the data generation and class imbalance? How does the clinician and the model take these imbalances into account and should they?
Photos by Willem De Meyer and Frank Busch on Unsplash

Most data generating processes are heavily influenced by humans and their environment. Therefore they often show trends, cycles and strong daily, weekly & annual patterns. Basically, whenever the data generating process is interacting with humans and cannot be automatically executed in an isolated laboratory, chances are high that some of these time dependent components can be found and must be modelled.

How-to check if data depends on time?

Whenever you are working on a machine learning problem and your data does not already include time, ask yourself the following questions:

  • What process generated the data?
  • How much time did it take to generate this data?
  • Was the environment of the data generating process stable during that time?
  • Was the process (and business) itself stable during that time?
  • Will the environment and the process (and business) stay stable in the future?
  • Are the statistic properties of the generated data stable over time?

These abstract questions can be supported by more concrete questions to check the process, environment and data for common time dependent effects:

  • What effect could seasons, weather conditions and holidays have? Could this influence an annual pattern?
  • What effect could light conditions, noise and working hours have? Could this influence a daily or weekly pattern?
  • What effect could maintenance have on the data? Are there typical maintenance cycles?
  • What effect could social, scientific and economic progress (or change) have on the data? Have there been any trends or changes in the past or are they expected in the future?
  • What effect could demographic or organizational change have on your data? Have there been any changes in the past or are they expected in the future?
Illustration by Author

Once, potential time dependent effects have been identified you should check if these effects are present in the data and if your data actually covers the necessary time spans to detect it.

What to do if your data depends on time?

If the data depends on time there are three basic options to handle it.

  1. Exclude time effects by adjusting the environment or process. This is only viable, if the output of the model does not have to include time effects like trend and seasonality. In the quality control example from above this could mean to adjust the shop floor situation to guarantee stable lighting conditions.
  2. Exclude time effects by pre-processing the data. This is also only viable, if the output of the model does not have to include time effects like trend and seasonality. In the quality control example from above this could mean to modify images to remove the influence of the lighting conditions.
  3. Include time effects by adjusting the machine learning model and data. Various modelling & feature engineering techniques can be applied to enable the machine learning model to include the time effects. If the output of the model depends on the time effects a time-aware validation scheme should be used. In the quality control example from above this could mean to include various lighting situations in the training data and maybe even generating features that make the model aware of the lighting condition.

What experience regarding time have you made in your machine learning projects? Would you like to read a story about the various modelling & feature engineering techniques and validation schemes to include these time effects?

Thanks for reading and I’m looking forward to your comments!