Beyond Weisfeiler-Lehman: using substructures for provably expressive graph neural networks

Original article was published on Deep Learning on Medium


How powerful are graph neural networks?

Beyond Weisfeiler-Lehman: using substructures for provably expressive graph neural networks

TL;DR: In this post, I discuss how to design local and computationally efficient provably powerful graph neural networks that are not based on the Weisfeiler-Lehman tests hierarchy. This is the second in the series of posts on the expressivity of graph neural networks. See Part 1 describing the relation between graph neural networks and the Weisfeiler-Lehman graph isomorphism test. In Part 3, I will argue why we should abandon the graph isomorphism problem altogether.

Recent groundbreaking papers [1–2] established the connection between graph neural networks and the graph isomorphism tests, observing the analogy between the message passing mechanism and the Weisfeiler-Lehman (WL) test [3]. WL test is a general name for a hierarchy of graph-theoretical polynomial-time iterative algorithms for determining graph isomorphism. The k-WL test recolours k-tuples of vertices of a graph at each step according to some neighbourhood aggregation rules and stops upon reaching a stable colouring. If the histograms of colours of the two graphs are not the same, the graphs are deemed not isomorphic; otherwise, the graphs are possibly (but not necessarily) isomorphic.

Message passing neural networks are at most as powerful as the 1-WL test (also known as node colour refinement), and thus unable to distinguish between even very simple instances of non-isomorphic graphs. For example, message passing neural networks cannot count triangles [4], a motif known to play an important role in social networks where it is associated with the clustering coefficient indicative of how “tightly knit” the users are [5]. It is possible to design more expressive graph neural networks that replicate the increasingly more powerful k-WL tests [2,6]. However, such architectures result in high complexity and large number of parameters, but most importantly, typically require non-local operations that make them impractical.

Examples of non-isomorphic graphs that cannot be distinguished by 1-WL but can be distinguished by 3-WL due to its capability of counting triangles.

Thus, provably powerful graph neural networks based on the Weisfeiler-Lehman hierarchy are either not very powerful but practical, or powerful but impractical [7]. I argue that there is a different simple way to design efficient and provably powerful graph neural networks, which we proposed in a new paper with Giorgos Bouritsas and Fabrizio Frasca [8].

Graph Substructure Networks. The idea is actually very simple and conceptually similar to positional encoding or graphlet descriptors [9]: we make the message passing mechanism aware of the local graph structure, allowing for computing messages differently depending on the topological relationship between the endpoint nodes. This is done by passing to message passing functions additional structural descriptors associated with each node [10], which are constructed by subgraph isomorphism counting. In this way, we can partition the nodes of the graph into different equivalence classes reflecting topological characteristics that are shared both between nodes in each graph individually and across different graphs.

We call this architecture Graph Substructure Network (GSN). It has the same algorithmic design and memory and computational complexity as standard message passing neural networks, with an additional pre-computation step in which the structural descriptors are constructed. The choice of the substructures to count is crucial both to the expressive power of GSNs and the computational complexity of the pre-computation step.

The worst-case complexity of counting substructures of size k in a graph with n nodes is 𝒪(nᵏ). Thus, it is similar to high-order graph neural network models or Morris [2] and Maron [6]. However, GSN has several advantages over these methods. First, for some types of substructures such as paths and cycles the counting can be done with significantly lower complexity. Secondly, the computationally expensive step is done only once as preprocessing and thus does not affect network training and inference that remain linear, the same way as in message-passing neural networks. The memory complexity in training and inference is linear as well. Thirdly and most importantly, the expressive power of GSN is different from k-WL tests and in some cases is stronger.

How powerful are GSNs? The substructure counting endows GSN with more expressive power than the standard message-passing neural networks. First, it is important to clarify that the expressive power of GSN depends on the graph substructures used. Same way as we have a hierarchy of k-WL tests, we might have different variants of GSNs based on counting one or more structures. Using structures more complex than star graphs, GSNs can be made strictly more powerful than 1-WL (or the equivalent 2-WL) and thus also more powerful than standard message passing architectures. With 4-cliques, GSN is at least no less powerful than 3-WL, as shown by the following example of strongly regular graphs on which GSN succeeds while 3-WL fails:

Example of non-isomorphic strongly regular graphs with 16 vertices and node degree 6, where every two adjacent vertices have 2 mutual neighbours, and every two non-adjacent vertices also have 2 mutual neighbours. The 3-WL test fails on this example, while GSN with 4-clique structure can distinguish between them. In the graph on the left (known as the Rook’s graph) each node participates in exactly one 4-clique. The graph on the right (Shrikhande graph) has maximum cliques of size 3 (triangles). Figure from [8].

More generally speaking, for various substructures of 𝒪(1) size, as long as they cannot be counted by 3-WL, there exist graphs where GSN succeeds and 3-WL fails [11]. While we could not find examples to the contrary, they might in principle exist — that is why our statement about the power of GSN is of a weak form, “at least not less powerful”.

This holds for larger k as well; a generalisation of strongly regular graphs in the above figure, called kisoregular, are instances on which the (k+1)-WL test fails [12]. These examples can also be distinguished by GSN with appropriate structures. The expressive power of GSNs can thus be captured by the following figure:

GSN is outside the Weisfeiler-Lehman hierarchy. With the appropriate structure (e.g. cliques or cycles of certain size), it is likely to be made at least not less powerful than k-WL.

How powerful can GSN be in principle? This is still an open question. The Graph Reconstruction Conjecture [13] postulates the possibility of recovering a graph from all its node-deleted substructures. Thus, if the Reconstruction Conjecture is correct, a GSN with substructures of size n−1 would be able to correctly test isomorphism of any graphs. However, the Reconstruction Conjecture is currently proven only for graphs of size n≤11 [14], and second, such large structures would be impractical.

The more interesting question is whether a similar result exists for “small” structures (of 𝒪(1) size independent of the number of nodes n). Our empirical results show that GSN with small substructures such as cycles, paths, and cliques work for strongly regular graphs, which are known to be a tough nut for the Weisfeiler-Lehman tests.

Most importantly, GSN builds on top of standard message-passing architectures and thus inherits its locality and linear complexity. The hyperparameters of the method include the structures counted for the construction of the structural descriptors. It is likely that practical applications will be guided by the tradeoff between the required expressive power, the size of the structures that can guarantee it, and the complexity of computing them.

Using GSN with cycles of length k≥6 significantly improves the prediction of chemical properties of molecular graphs in the ZINC database that is used by pharmaceutical companies for virtual screening of drug candidates. Such cyclic structures are abundant in organic molecules. Figure from [8].

In our experiments, we observed that different problems and datasets benefit from different substructures, so it is likely that this choice is problem-specific. Fortunately, we often know what substructures matter in some applications. For example, in social networks, triangles and higher-order cliques are common and have a clear “sociological” interpretation. In chemistry, cycles are a very frequent pattern, such as 5- and 6-membered aromatic ring that appear in a plethora of organic molecules. The figure below shows an example most of us are familiar with, the molecule of caffeine, whose level in my bloodstream is alarmingly low. This sounds like the right moment to finish this post and make myself a cup of coffee.

1,3,7-Trimethylxanthine, better known as caffeine, is an example of a chemical cyclic compound containing 5- and 6-rings (denoted by red and yellow).