Data Driven Decision Making? Meh

Original article was published on Artificial Intelligence on Medium


Fundamental limits of learning from data

There are at least three large problems with learning from data:

  • inferring causality
  • inferring long-term effects
  • forecasting the future

Inferring causality

The trite saying that correlation does not imply causation is especially known to economists who are supposed to give policy advice i.e. how to guide actions to change outcomes for the better. This is quite difficult to do by just looking at the data (correlation).

Consider the task of bringing down crime rates by the optimal allocation of police stations. You observe that the number of police stations is positively correlated with crime between different districts. A pure prediction model would imply that increasing the number of law enforcement results in higher crime rates — leading to extremely erroneous decisions. What we actually need is a model of the impact of a change in police numbers to a change in crime rate, we want to know the treatment effect or uplift of a change in policy — how much is the policy better than no policy.

This requires setting up either a randomised experiment by the policy maker or (more feasible) by clever researchers to come up with a natural experiment that quasi-randomly changes the number of police stations in some districts. The treatment effect can then be recovered by sampling two time points before and after the “treatment” and comparing the average difference between district crime rates between treated and non treated police stations (this approach is called a difference in differences model).

Note the general implications of this: no matter how many petabytes of observational data on police stations and crime rates you gather, without either running a costly experiment or an ad hoc human-defined analysis, you can’t recover the treatment estimate (the true “knowledge” you are after). Scaling the amount of data alone does little to improve your decision making ability.

X causes Y but W causes both X and Y. Inferring the impact of changes in X with respect to Y will fail using observational data due to what’s called a confounding effect, another obstacle in inferring causality. The most famous example is X as education level, W as inherent mental ability and Y as life outcomes. You will overestimate the treatment effect because in data, people with high W have high X (and high Y), just assigning high X to people with low W will not work as well

Knowledge is slow

How beneficial is a vegan diet to your body? This is a question I came by a while ago, trying to optimise my nutrition from a health perspective. After skimming through many articles on PubMed, I came to realise this is a very hard question. To estimate the true effect, you would have to randomly sample sufficiently large populations to a vegan (strictly defined) and a control diet (either what they would otherwise eat or some defined average diet) and enforce this for a long period, preferably a few years, but possibly for multiple generations (if you are afraid that the diet might have genotype altering effects, as a vegan skeptic pointed out to me at a dinner party). In practice, such intervention studies are only run for a few weeks due to the prohibitive cost and our limited knowledge is based on that.

There is a fundamental tradeoff between attaining confidence in knowledge and time/resource cost. Let’s assume the best case scenario: you are a Google search engine developer and want to test two separate ad placement options:

  • A — large and annoying
  • B — small and forgettable

You can randomly split incoming sessions between A and B (through a technique called A/B tests, discussed in more detail below). By the simple clickthrough rate metric, A might be preferred to B. However, an obvious consequence of being exposed to annoying ad A is decreased retention — the user will choose an alternative search engine next time or install software that blocks ads. Clearly Google cares more about long-term retention than instantaneous clickthrough rate. In order to properly test for this, the developer should run an A/B test splitting on users instead of sessions for a longer period of time, to see in which group users start dropping off like flies. Ideally, the test would be run for a very long time, up to the average lifetime of the user (which is prohibitively long for Google users). In practice, the developer might opt for a 2-week test, designing some arbitrary compound metric based on clickthrough rate and retention (how many users keep coming back through the 2 weeks) and base her decision on this.

Most metrics that we care about in real life are long term:

  • companies try to maximise lifetime profits/revenue
  • in life, you try to maximise the sum of happiness over time
  • governments (perhaps) try to maximise re-election probability

Testing the treatment effect then becomes tricky: your actions (for example, smoking) might benefit the short term while hurting the long term. Learning long-term effects is also difficult for computers: OpenAI Five, celebrated for employing strategic thinking in the game of Dota 2 (looking at the state of the game and determining which actions have the highest treatment effect towards winning the game) had to train on 180 years worth of games per day, a luxury not available if the data generating process runs in real time.

To further exemplify how difficult it is to incorporate long term effects into decision making, consider the following anecdotes:

  • Smoking, today understood to be the largest contributor to premature deaths, required decades of research and a meta-analysis of 7000 articles by the Surgeon General Luther L. Terry in 1964 to make it to a definitive stage in decision making in the US. Even then, there’s a special section in the report highlighting that while most studies provide an association between disease and smoking, it’s much more difficult to prove a causal link (that smoking actually causes disease). Using the epidemiological method, which itself essentially means imposing hopeful assumptions on observational data, the study suggested a causal effect for lung cancer (for men only) and bronchitis but not for cardiovascular diseases.
Changing the Paradigm of Cancer Screening, Prevention, and Treatment — Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Per-capita-cigarette-consumption-and-male-lung-cancer-incidence-in-the-United-States-over_fig3_311239192
  • Lead has been known since antiquity to be a health hazard. Yet it was introduced into petrol fuel in the 1920s. The industry lobby groups suppressed research into the harmful effects for 50 years, until Herbert Needleman provided sufficient evidence that lead exposure lowers children’s intelligence. Despite trying to discredit the findings again, the industry was forced into decline after EPA phased out leaded fuel between 1976 and 1996, reducing blood lead levels by 78% and crime rates by 34%. It took until 2011 for UN to be able to claim lead petrol phaseout worldwide — almost a 100 years after the introduction of the product!
  • The long-term health impact of the Chernobyl disaster is still widely debated today. While there seems to be a strong link with increased thyroid cancer, the estimate for the number of excess deaths ranges from 62 (UNSCEAR 2008) to 200000 (Greenpeace 2006, not peer reviewed). The difference comes from estimation of the long-term effect: UNSCEAR says it is zero and Greenpeace says it is large because they are against nuclear energy. The matter is entirely subjective: there’s no scientific way to estimate a treatment effect that is not statistically significant.
  • 5G conspiracists undermine the technology by pointing out that there have been no long-term studies done for the health impact. But there are no long-term studies for many modern changes in lifestyle, for example the mental health effect of reading “alternative” news outlets.

Prediction is hard, especially about the future

There has been enough literature written on the subject that if you still believe that your pension fund manager can predict the market or your local government official the direction of the economy or the geopolitical pundit on the evening news show the course of future conflicts with any degree of confidence, you are not only wrong but should quickly revise your beliefs to not place trust in unwarranted places.

Any inkling towards big data and machine intelligence solving forecasting and reducing the uncertainty of future should be briskly wiped away. Contrarily, the network structure of globalised economy and increasing speed of innovation hints at the opposite: the future becoming less and less foreseeable. Contrast this with the relative stability of peasant life in medieval Europe.

Nominal commodity prices reflect how generation upon generation lived in a similar level of technological innovation before the dawn of the modern era. Image: When did globalisation start? Economist

A known result of chaos theory is that it is quite useless to predict weather more than a week or two ahead due to the complex nature of the partial differential equations governing its evolution. We understand weather dynamics fairly well — it’s just the mathematical obstacle of instability that haunts our forecasts. The implications for something such as the economy or a war or product market fit should be clear: complexity-wise it has the same properties as weather, but instead of a physical model governing the internal dynamics we have no clue. To quote the later convicted investor Mark Hanna, immortalised by Matthew McConaughey in the movie Wolf of Wall Street:

Number one rule of Wall Street: I don’t care if you’re Warren Buffett or if you’re Jimmy Buffet, nobody knows if a stock is gonna go up, down, sideways or in circles, least of all stock-brokers.