On the Automation of Time Series Forecasting Models: Technical and Organizational Considerations.

Source: Deep Learning on Medium

On the Automation of Time Series Forecasting Models: Technical and Organizational Considerations.

This post is an elboration on a reply that I originally posted to a question on Cross-Validated (Stackoverflow’s sister site for statistics and data science related topics).

The original question was:

I would like to build an algorithm that would be able to analyze any time series and “automatically” choose the best traditional/statiscal forecasting method (and its parameters) for the analyzed time series data.

Would it be possible to do something like this? If yes, can you give me some tips on how can this be approached?

TL,DR: Yes it is.

It is true that most introductory text books and tutorials on time series forecasting dedicate a lot of space to explaining the intricacies of deciding which model best fits your data. From this, you would be forgiven for getting the impression that a day in the life of a practicing forecaster consists in her staring at ACF and PACF plots, toying around with various ARIMA parameters until she finds the best model, or scratching her head trying to figure out whether to use an exponential smoothing model with trend only or an exponential smoothing model with trend + seasonality.

On behalf of the forecasting community at large, I apologize for the misdirection, because that is not the case, and hasn’t been so for a while. Professional forecasters have much better things to do with their days than tearing their hair out because of indecipherable PACF plots. Nowadays, the forecast modeling part is almost completely automated, and they are more likely to spend their time using domain knowledge to review the output and decide whether to go with the forecast or not, or in meetings trying to convince skeptical business stakeholders that their forecasts are reasonable. In fact, it has been a while since most professional forecasting tools (and even some open source tools) have totally automated the parameter selection part of the forecasting process.

In my experience, how to interpret real world ACF and PACF plots is one of the most confusing things somebody has to deal with when beginning their time series forecasting journey.

In this post I will go over the technical aspects of automatic forecast generation, as well as some of the organizational considerations that will arise when deciding to go with an automatic forecast generating system.

Automated times series forecasting as an excersize in model selection:

As I said earlier, in many fields, including my field of retail demand forecasting, most commercial forecasting packages do perform automatic forecast generation. Several open source packages do so as well, most notably Rob Hyndman’s auto.arima() (ARIMA) and ETS() (exponential smoothing) functions from the open source Forecast package in R. There’s also a Python implementation of auto.arima called Pyramid, and several other python packages are in the works.

Both the commercial products and the open source packages that I mentioned work based on the idea of using information criteria to choose the best forecasting model: You fit a bunch of models, and then select the model with the lowest AIC, BIC, AICc, etc….(typically this is done in lieu of out of sample validation — see this presentation for details).

An example of the AIC and BIC selecting the “true” model of our data: We generate a data set using a 6th order polynomial with Gaussian noise, we then notice that both the AIC and BIC are at there lowest for the polynomial regression that is of order 6, i.e. they allowed to select the true model, even if polynomials of order 8 and 9 gives us lower RMSE.

There is however a major caveat: all of these methods work within a single family of models. They choose the best possible model amongst a set of ARIMA models, or the best possible model amongst a set of exponential smoothing models, etc…

It is much more challenging to do so if you want to choose from different families of models, for example if you want to choose the best model from ARIMA, Exponential smoothing and the Theta method. In theory, you can do so in the same way that you do within a single family of models, i.e. by using information criteria. However in practice, you need to calculate the AIC or BIC in exactly the same way for all models considered, and that is a significant challenge. It might be better to use time series cross-validation, or out of sample validation, instead of information criteria, but that will be much more computationally intensive (and tedious to code), not to mention the question of which suitable forecast horizon to cross-validate against.

The “One model to rule them all” approach:

Facebook’s Prophet package also makes it possible to automate forecast generation based on General Additive Models (See here for details). However Prophet fits only one single model, albeit a very flexible model with many parameters. Prophet’s implicit assumption is that a GAM is the “One model to rule them all”, which might not be theoretically justified but is very pragmatic and useful for real world scenarios. More specifically, the basic assumption underlying the prophet model is that most useful real world time series do not contain any structure beyond trend, seasonality, and causal/holiday effects, and we don’t gain much by trying to mine them for complex auto-correlations the way a non-trivial ARIMA model does. Besides making it easy to strike a balance between automation and flexibility, the trend + seasonality + causal effects approach to Prophet makes it very convenient for communicating results to non-technical or scientific people (If you are a student or a beginner in the filed, brace yourself, a brief note: Haggline with business over the validity of your forecasts will consume a significant portion of your day. Go get your soft skills brushed up while you are at.

The Facebook Prophet API makes it very easy to communicate the results of a forecast to business stake holders, compared to say, explaining what and ARIMA(2,1,2) model does.

One might ask: But isn’t that what Triple Exponential Smoothing (Holt-Winters) does? Fit a trend component and a seasonal component and leave it at that? Are we just trying to use a Facebook tool for hype purposes? Not quite: Prophet is a little bit more sophisticated in that regard: It can model break points in the trend, and multiple seasonalities (e.g. for example daily and monthly peaks in a cusotmer time series). It can also handle causal events well, which exponential smoothing can’t (You would have to add the causal effects as some sort or post-processing step). Part of Prophets felxibility comes from the fact that it is not “just” a GAM model. Under the hood, there is a lot of Bayesian heavy lifting going on, including in the causal modeling part.

The No Free Lunch Theorem, the time series forecasting edition (sort of):

Presumably, one of the main reasons you want to do automated time series forecasting, is because you want to forecast multiple time series, too many to analyze manually (The tool my team uses generates millions of forecasts daily — one for each of our product/location combinations). Therefore your automated forecasting procedure must be able to fit to different types of time series with different business scenarios. In the general ML case, there is a theoretical result called the No Free Lunch Theorem: There is no such thing as a “supervised ML model to rule them all”, one that can generate the best out of sample predictions without having to make any assumptions about the structure of the data set. So if you’re going to throw a generic ML model, say a feed forward neural network, at your problem space, and hope that it will work for every possible data configuration that can occur, you will also have to accept that sometimes there will be a model that performs better than the one you ended up with. It is the price you pay for having one generic model for everything, hence “no free lunch”.

Something similar happens with time series models: You need to keep in mind that an automated forecasting approach is never going to find the absolute best model for each and every single time series — it is going to give a reasonably good model on average over all the time series, but it is still possible that some of those time series could have better models than the ones selected by the automated method.

See this post for an example of this. To put it simply, if you are going to go with automated forecasting — you will have to occasionally tolerate “good enough” forecasts instead of the best possible forecasts for each time series. That is the price you pay for flexibility and robustness.

In this case, manually selecting higher order ARIMA parameters — i.e with higher AIC and BIC — gives better out of sample forecasts than the one automatically by auto.arima(), but the one selected using the lowest AIC and BIC is still pretty close to the ground truth, compared to, say, a non-seasonal model (seasonality and trend were not passed in any way to auto.arima(), it figured that out on its own).

ML Based forecasting – things get a lot more complicated, yet paradoxically simplified:

As long as you stick to statistical methods for time series, automatic forecasting remains, if not an easy problem, at least an approachable one.The problem amounts to a statistical model selection question (Or a curve fitting exercise in the case of FB Prophet), which has some solid theoretical foundations and is discussed and explored in several graduate level text books.

If you are planning on using ML based forecasting models, then the issue becomes a case of the more general problem of auto-ML (automated machine learning). Both the academic literature and the technology are not as mature when it comes to the question of hyper-parameter tuning and auto-ML as it applies to time series forecasting (as always, NLP and Computer vision get all the spotlight and attention first…). There are some interesting auto-ML and Bayesian Optimization tools which can be used for the task, as well as a couple of commercial products (e.g. Google Vizier), but there are still many open questions regarding Bayesian Optimization and Transfer Learning in the time series case.

When using a machine learning forecasting method like seq2seq models, you generally apply one ML model to an entire group of time series, as opposed to having one model fit per each series

Additionally, with ML based methods, you are most likely going to develop large single global models for multiple time series forecasts (For example in a retail context, one model per department), as opposed to having specific models for each individual time series. This makes feature selection and engineering more challenging as well.

Finally ML models are not as easy to explain and defend in front of a business audience. Even a trained data scientist would need to resort to model interpretability tools in order to understand some of the outputs of an ensemble of tree-based regressors or complex seq2seq network.

I also said that it will make things paradoxically simpler. How so? Well, unlike statistical methods, which need to be retrained every single time we want to generate a new forecast, in this case we can get away with retraining the model only once every few months. This is because a global model is trained on a much larger chunk of data, and in a sense “it has already seen it all” during its training phase, and can figure out how a time series will be have in the long run with the need for constant updates (BTW, if you don’t have a large data set, stay away from ML based methods, and stick to statistical methods. I know, its not as impressive in front of your boss and your clients, and you really, really, wanted to use PyTorch for your project, but trust me, you’ll thank me in the long run).

This difference (having to refit the model every time vs. having to do it once every few weeks or months) is very significant, and goes way beyond the simply question of how often you will have to go through the training and evaluation process:

  • From an engineering point of view, the architecture of a production statistical forecasting system will be radically different from the architecture of an ML based production forecasting system. In the first case, your model fitting compute resources will have to live within your production forecast engine, since the model fitting stage and the prediction stage are so tightly coupled. This puts all sorts of demands on the performance requirements of the environment you will use. In the second (ML) case, you can separate the training compute resources from the prediction compute resources, and the performance requirements of your system are not as stringent.
  • In the ML case, the fact that you only have to train the model every 3-to-6 months means that you have ample time to perform the training offline, and the process doesn’t need to be in perfect synch with a daily production schedule. This gives both the data scientists and the engineering team a lot of leeway in terms of SLA, being able to start over if they don’t like the results, deciding whether they want to include new features in the model, etc…
  • In the statistical case, you have no such luxuries. Since you have to retrain so often, you are likely going to be facing strict SLA constraints (e.g. you only have 8 hours at night to generate new forecasts, etc…). Moreover your model selection process has to be very robust, since you will not be able to rerun it if you want to make your SLA.

This is why even though using an ML approach to forecasting is conceptually more difficult, and has yet to reach maturity in terms of best practices and guidelines automated parameter and feature selection, it does makes things easier from an engineering point of view, given that it puts less demands on the production forecast engine and is more flexible from an SLA point of view.

Its the year 2019, and we are still channelling the utterings of an 18th century cleric…will Bayes ever become obsolete?

“Human in the loop” and the two types of forecasting organizations:

If your organization has a need to perform automated time series forecasting, there will almost also be a need for monitoring as well, since no forecasting engine is so good that it can be “completely trusted” with the forecasts. Therefore some mechanism for human analyst intervention is necessary. Moreover, whatever down stream systems are consuming your forecasts, you need to have logic in place to insure that the right forecasts are being fed to it. Various alerting mechanisms and sanity checks need to be implemented.

This leads to the question: Who is accountable for the forecast when the model is automatically generated? If a forecast overshoots by 350% and the company incures a significant loss as a result, which person or team should answer for it (mind you, this isn’t about disciplining people or firing them, it is about learning from the mistakes and avoiding similar ones in the future, and identifying root causes of anomalies)?

In my experience, for companies that do a lot of forecasting at scale, there are two types of forecasting organizations:

  • In the first type of organization, the forecasting function is carried out by data scientists, i.e. people with formal statistical and technical knowledge of the algorithms they are running. In this type of org, the people who produce the models are also the people who consume the models: The data scientist that decides which model to use (ARIMA? Prophet? LSTM? etc…), is also the person ultimately responsible for the forecast numbers and communicating them to the business stakeholders and to leadership. From the software side, they might work with an ML and data engineering team on the purely technical aspects of running the model in production, but they make the model design decisions and scheduling decisions and they typically know enough Python, R, or SAS to run the models themselves. In short, the person who produces the forecasting model is also the person who is accountable for the forecast.
  • In the second type of organization, there is a separation between the person who produces the forecasting model, and the person who consumes the model: A data scientist or machine learning engineer will be responsible for tuning, deploying and running the model (which might have been developed in-house or as part of a package purchased from a software vendor or consulting company). A team of business forecasters, who bring a lot of domain knowledge to the table (retail, supply chain, finance, etc….), but typically have little or no technical and science skills, will then review the forecasts and decide whether to go with the auotmatically generated forecasts or to manually adjust the forecasts (i.e. preform a judgemental forecast based on domain knowledge). They are the ultimate owners of the forecast and the ones who will be held accountable for the quality of the predictions.

Which one of the two business models your organization works with will obviously be a deciding factor in how you approach the problem of automated forecast generation:

  • In the first type of organization, you have some leeway in terms of forecast automation, in the sense that data scientists will have the necessary skills to be able to rerun forecasts as needed, perform out of sample testing to validate models, make modifications to the model on the fly, work directly with the engineering team that runs the models in production, etc….the flip side of this coin is that data scientists rarely come with the necessary business and domain knowledge to reliably make judgmental forecasts and override the output of the model if needed. They are typically closer to the tech org, or are part of an independent Analytics/A.I. org, and are not close enough to the business org to work effectively on making purely domain knowledge driven judgement calls.
  • In the second type of organization, your forecasters are usually part and parcel of the business org, and hence are perfectly equipped to make domain knowledge and market driven adjustments to the forecasts that aren’t always captured by statistical or ML models. This however, leads to several constraints from a forecast automation point of view: Your forecast automation needs to be much more reliable as a software product, and you need to have a much more robust pipeline and visualization infrastructure built around your models that can be used by non technical forecasters. You are now in an environment where you don’t have the luxury of just rerunning a model ad-hoc if the numbers seem off. You are (usually) limited to statistical models which are easily interpretable (in terms of seasonal effects, trends, causal variables, etc…). Upgrading to newer forecasting models as appropriate, and insuring business adoption once that is done, is a much more difficult process in this second scenario.

Each type of org has its advantages and drawbacks, but which type of org you are working with will be crucial in deciding the forecast automation framework you need to develop (or implement with an off the shelf product).

To summarize:

  • Yes, automatically selecting a forecasting model is possible. In fact tools to do so have been around for quite along time, at least for statistical forecasting models.
  • For statistical models, using information criteria like the AIC is one way to automate the process. It is also possible go with a “One model to rule them all” approach like Facebook Prophet.
  • In an automated forecasting scenario, you will have to compromise and accept that some of your forecasts will be good enough. That is the price you pay for an approach flexible enough for large time series data sets.
  • If you choose to go with an ML based forecasting method, you will have to use an auto-ML based approach (i.e. Bayesian Optimization + Transfer Learning) for model selection. Auto-ML approaches are not as well understood in the context of time series as traditional model selection methods are, but they have other advantages from an engineering point of view.
  • When the forecast is automatically generated, the question of forecast ownership, i.e. who is held accountable for the quality of the forecasts, arises. In some organizations, the data scientists themselves are accountable for the forecasts, and in other organizations, there are business forecasters (with no technical or math skills) who review the forecasts and have the final say in whether it should be kept or overridden. Which one of these two organizational models you are trying to serve will impact the way you design your automated forecast generation procedure.

It should be noted that a lot of the software considerations that I mentioned in the case of statistical forecasting methods (performance requirements, increased robustness and reliability, etc…) should be addressed by the software vendor you selected if you end up going with a prepackaged solution. Those considerations become a much bigger issue if you choose to build your own statistical forecasting solution in-house. The trade off is that with the prepackaged solution, you are usually tied to whatever model or models the vendor chose to include in their package. In some cases, you don’t even have the details of the model, only a high level description of what it does, and your job as an ML/DS person within your company is brought down to building pipelines and insuring that proper monitoring and dashboarding is in place. Fortunately, this has started to change, and more and more vendors are “opening up” their forecasting solutions, so that in-house data science teams can deploy tweaked or custom built models.

Finally, although I mentioned that ML based forecasting approaches are not as mature as statistical methods, several reports in the academic literature have discussed the topic. Moreover, several tools have become available in the last year or so that can help a data science team test the benefits of ML and DL based forecasting approaches. I look forward to exiting developments in the near future on this topic.