On the Difficulty of Finding Structural Causal Models

Original article can be found here (source): Artificial Intelligence on Medium

On the Difficulty of Finding Structural Causal Models

Research in causality is gaining popularity, but research in causality is difficult! Let’s take an intuitive look at why, and can it be useful for machine learning?

The link between correlation and causation has been discussed in science for quite a bit. Alas, we are all quite certain that correlation can be misleading. Hence the scientific community gave birth to causal inference and causal discovery. Previously I wrote an article on statistics versus causality, which is a bit more general than this one. In this article, I want to make it clear what exactly makes it so difficult to find causal models, in the context of graphical models, and yes, it is difficult.

[1]

Firstly, to understand the power of causal models, consider graphical models in general for describing probability distributions. Bayesian networks are such models (coincidentally introduced by the same person that introduced do calculus, Judea Pearl). Ialson the case of Bayesian networks, we want to model a probability distribution in a smart way. More concretely, we want to model a joint probability distribution. But not just in any way, we want to do it in a smart way, such that its computational overhead is minimal. Another way to put it, we want a sparse factorization of the joint probability distribution.

To be clearer, let’s write it down in simple terms. Let us assume we have n random variables, so the joint probability distribution that we are interested in is

Now, for those of you familiar with the Bayes rule, we can write out this joint probability distribution as the following:

As you can see, there are quite a few conditionals in this factorization depending on n. There are two things to notice here:

  1. Calculating those conditionals can be computationally expensive
  2. The statistics may be misleading

Thankfully, we can use the basic concept of independence in statistics to make this factorization sparser. The simple rule is if two variables are independent, questionthen we can write their joint distribution as

This means that we can remove certain variables from certain conditionals, which makes the computation easier and this can be nicely formulated in terms of a graphical model where the children are only dependent on their parents. As an example, consider the following Bayesian network (graphical model):

questionFrom the graph, we can read out the presented factorization of the joint distribution:

Finding the correct undirected graph is a hard problem. There are many such factorizations that may explain the data but are incorrect in the sense that they ignore certain independencies. So keep in mind that, even before we start thinking about causality, we have a hard problem to identify the correct dependencies!

We know that graphical models are nice in the sense that they can save computation time. But we still have the problem of the interpretation of a graphical model. Can we actually be sure that the parents cause the children variables? Not at all, since Bayesian networks are undirected graphs, there is questionno sense of a causal direction. It is just a joint distribution factorization. Now, for comparison, causal models have a few traits that other graphical models don’t. The causal direction manifests itself in the form of a directed edge in the graph. The edge itself symbolizes the causal mechanism.

[2]

causal inference and causal discovery

When we are looking for a causal model, we can be looking for different things. We may have models that explain the data, we have a graphical model, but we do not know the direction of the causal mechanism. Or we may search for the whole graphical models in the first place. The thing that we can be sure about is that the causal model is minimal, or is the sparsest possible factorization that explains the data.

causal inference and causal discoveryAnother neat assumption about causal models is the independence of causal mechanisms, meaning that if we have information of one causal mechanism, we cannot get any information about the others. Clearly, if the mechanisms in the causal model were dependent, then the model wouldn’t be casual. Interestingly, the same assumption is made by the human brain in vision, called the generic viewpoint assumption. The classical example of this is the Beuchet chair.

Not to get into too much detail, one such model is a simple two-variable model which can be written in two lines:

The causal link is the function f which is a function of X1 and some noise variable N2. To make the graphical model from above causal, we would simply put arrows on it to indicate the causal direction:

Well, that’s nice, but the problem is that we are really not sure about the arrows. We can reverse the arrows as an example like in the image below. We can probably find models that explain the correlation in the anti-causal direction, which is not correct:

Searching for causal models is difficult. Again, think about the number of arrow arrangements that we can do in such a model. Each edge has 2 possible directions, if a causal model has E edges, that means that there exist 2^E possible direction arrangements. The search space (just for the edge directions) is exponential in the number of edges. If we think about that we do not know where the edges are in the first place, and that we have V vertices (variables). We have this nice formula for the possible edge configurations in the graph:

This means that for each graph from the above formula, we have additional 2^E possible arrangements of its edges… This is a real mess for high dimensional problems. And then when we consider that we might actually want to learn the mechanisms themselves, i.e. to fit the models… What a mess. This is to show that…

Causal discovery is a difficult problem!

[1]

It is an open research question to find methods to identify causal direction between variables efficiently. But in some cases, the causal direction is simply not identifiable. This depends on the type of functions that describe the causal mechanism, or type of noise distributions that are used within the mechanism. Consequently, the assumption of identifiability is made when we want to find causal directions in the data.

As it turns out, sometimes we indeed want to learn the anti-causal direction (i.e. to predict the cause from the effect), since there are cases where this is perfectly predictable. The usefulness of causal inference in machine learning is quite a difficult topic also, everyone is speaking about “the next level of AI required causal reasoning”, but actually nobody really realizes what this encapsulates. The research is flooded with wrong uses of the word “causal”. Put, to end with a positive outlook, that leaves a lot of room for contributions!