How can Machine Learning algorithms include better Causality?

Original article can be found here (source): Artificial Intelligence on Medium

Photo by Honey Yanibel Minaya Cruz on Unsplash

How can Machine Learning algorithms include better Causality?

In recent years, machine learning algorithms have known a great success. Thanks to the availability of important amount of data and the increase of computation speed, they have outperformed usual statistical methods.

Nevertheless, as I was learning more about how they work and how to apply them, I came to a surprising fact: most of these algorithms were focusing on making the most accurate predictions or classifications rather than proving cause-and-effect relationships.

And yet, these kind of relationships can be crucial in decision making, especially in the sectors of health, social or behavior sciences.

Consider the following questions:

  • How effective are nicotine patches in reducing people smoking habits
  • What is the impact of a renewal policy in the development of deprived area?
  • By how much did the last marketing company contribute to sales’ growth?

You can see that these questions are causal questions rather than associative questions. They require not only to prove the cause-effect relationship but to quantify it.

Most of the time, experimental interventions are used: analysts carry out surveys, gather data and analyze it with sophisticated statistical methods. However, these experiments can be costly both in terms of time and money and even raise ethical questions in some cases.

Is there any other alternative?

In this article, I will share with you my key findings about some important causal modeling tools such as structural models, causal diagrams and their associated logic.

After reading this article, you will learn:

  • What are the different levels of causal inference?
  • How to learn a causal structure through graphs?
  • How to quantify causal effects?

Correlation does not imply causation

Before getting started, it is essential to revisit the well-known adage: correlation is not causation. It means that you cannot legitimately deduce a cause-and-effect relationship between two variables only because you have observed a correlation between them.

To illustrate this point, let consider the following graph published by Messerli in 2012 in his paper Chocolate Consumption, Cognitive Function, and Nobel Laureates (full paper here)..

As pointed out by the author, there is a correlation between a country’s level of chocolate consumption and its population’s cognitive function (r=0.791, p<0.0001). Does it mean that eating more chocolate would make smatter? Unfortunately, probably not!

To put it more precisely, if two random variables X and Y are statistically dependent (X ⊥ Y), then either:

  • X causes Y,
  • Y causes X, or
  • there exists a third variable Z that causes both X and Y. In that case, X and Y become independent given Z, i.e. X ⊥ Y | Z

This is known as the Common Cause Principle, introduced by Hans Reichenbach in 1956.

Therefore, in order to truly investigate the impact of chocolate consumption, it is necessary to carry out an experiment. This would require, for instance, forcing a country to eat more chocolate and observe whether it leads to a higher number of Nobel laureates.

As you can see, experimentation can simply be impossible. It can even raise ethical question when it comes to heath issues. Consequently, other tools are required to prove causality.

The ladder of causal inference

As developed by Pearls in his work on causal reasoning, causal information can be classified into a tree-level hierarchy.

  1. Association
  2. Intervention
  3. Counter- factual

This hierarchy brings a useful insight about what kind of questions each class of information can answer.

Let’s further develop each level…

Association or the action of Seeing

It is the first level and thus to most basic one. It relies on purely statistical relationships using the available data.

For instance, customers who buys flour are most likely to also buy butter. This kind of associations can be directly established using conditional probabilities and expectation from the observed data. If x is the quantity of flour bought and y the quantity of butter, then we are able to compute P(y|X=x) based on the data.

Current machine learning methods are perfectly suited to answer this kind of task. One can think of the effectiveness of recommendation engines used by Amazon and similar companies. However, their results tell us little about the actual statistical dependency between variables.

Intervention or the action of Doing

This level ranks higher than Association because it consists not only in observing data but change it.

For instance, in our previous case, intervention would have been necessary to answer the following question: what would happen if we triple the price of flour?

As such an increase of price would have probably led to a change in customer’s behaviour, it is impossible to answer this question simply from the observed data. If x is the quantity to of flour bought and y the quantity of butter, then we want to compute P(y|X=do(x)) , where do denotes that we have intervened to set the value of X to x.

It is important to keep in mind that when there is a cofounder, i.e. a variable that influences both the dependent variable and independent variable:

P(y|X=do(x)) ≠ P(y|X=do(x))

To illustrate this point, consider that customer buy flour and butter only to bake a cake. The increase of the price of flour would have discouraged them from baking at all!

Counterfactuals or the action of Imagining

The last level is represented by Counterfactuals. They answer the typical question : What if I had acted differently? They rely thus on aretrospective reasoning.

For instance, was the quantity of flour I bought the only reason why I bought some much butter or was it due the current promotion?

If x is the quantity to of flour bought and y the quantity of butter, then we want to compute P(y|x’y’) , i.e. the probability that event Y = y would be observed had X been x’, given that we actually observed X to be x and Y to be y’.

Note that a model that can answer counterfactual questions can also answer questions about interventions and observations. This possibility does not work in the opposite direction. This why counterfactuals are placed at the top of the hierarchy.

Causal discovery methods

Now that we have identified and classified the different causal inferences, let’s focus on the main methods to establish them.

Modeling of causal structures through graphical model

A graphical model can be considered as a map of a dependence structure for a given probability distribution.

Causal structure can be visualized through: Directed Acyclic Graphs. It is a mathematical tool consisting of a graph, i.e. composed of nodes and edges directed, that does not contain any cycle.

However, a same dataset and the conditional (in)dependences between its variables can lead to multiple DAG.

Let’s consider the following example : we want to identify the main factors that influence students’ grades and measure their effects. For the sake of simplicity, we will only take 3 variables: X the number of class hours in mathematics (including individual lessons at home), Y the distance between students’ home and the school, Z the students’ grade in mathematics. We can assume that X ⊥ Y | Z.

This assumption leads to four possible DAGs:

This is why it is necessary to introduce an additional notion: Equivalent class. It’s a set of graphs with the same skeleton but with different edge marks. DAG’s equivalent class is called Completed partially directed acyclic graph (CPDAG).


If it is easy to conclude on the conditional dependence of nodes in the case of 3 variables, larger graphs require an additional tool: d-separation.

Two nodes X and Y are d-separable by L, a set of nodes, if conditioning on all members in L blocks all paths between the two nodes.

Thus, the notion of d-separation provides us with (in)dependence relations defined on graphs that reflects conditional (in)dependencies between variables.

A well-known algorithm to lean a DAG is the PC algorithm. It starts with a complete undirected graph, G0 and carry out a series of conditional independence tests where edges are deleted. This lead to skeleton that is then directed based on the information saved in the conditioning sets.

However, two issues can arise:

  • Hidden variables or confounder, i.e. variables not included that influence variables
  • Selection bias due to the choice of variables and sample.

In that case, we need to find structure that represents all conditional independence relationships among the observed variables given the selection variables. In other worlds, we only want to visualize the conditional independencies among observed variable and marginalize all latent variables.

Since DAGs are not closed under margination we need to use another class of graph: Maximal ancestry graph (MAG). This class allows for missing edge which corresponds to a conditional independence. Similarly, Partial ancestral graph (PAG) are the equivalence class of a MAG and m-separation, a generalization of d-separation.

Learning a PAG can be done via the FCI algorithm (“Fast causal inference”) which uses a similar approach to PC but with more conditional independence tests and more orientation rules. It is also possible to use the RFCI algorithm (“Really fast causal inference”), which is faster but the output is in general slightly less informative.

Estimation of causal effects

Observing data only do not enable us to quantify the causal effect of a variable on another. To do so, we need to measure the state of Y if X is forced to take a value of x and compare this to the value of Y if X is forced to take the value
of x + δ. We rely for this on the distribution of P(y|X=do(x)).

When there is no hidden variable and selection bias and when the causal structure is a known DAG, information on the interventional distribution can be obtained by using a set of inference rules known as “do-calculus” developed by Pearl.

In practice, the causal structure is rarely known. Nevertheless, it is possible to still have an estimation by considering equivalence class of the true causal DAG and apply the do-calculus on each DAG within the equivalence. This gives us an approximation that can be useful.

These ideas are incorporated in the IDA method (Intervention calculus when the DAG is absent).

Key findings

Causal modeling tools can used to overcome 3 main issues that Machine learning algorithms face:

  1. Lack of the ability to adapt to new circumstances they have not been trained for;
  2. Limited explainability as they give few reasons behind their predictions or recommendations which can lead to user’s distrust;
  3. No understanding of cause-effect connections.

Causal inference goes beyond prediction as it models the outcome of interventions and formalizes counterfactual reasoning.

Lastly, the following graph provides a summary of the main tools and algorithm discussed in the article.