Original article was published on Deep Learning on Medium

# Understanding the Building Blocks of Graph Neural Networks (Intro)

## Intuitions (with running code) on the framework to analyze and learning graph data with neural architectures

This post is an introduction to a series of articles on *Graph Neural Networks* (GNNs). The goal of this series is to provide a detailed description, with intuitions and examples, of the GNNs building blocks.

In this series, I will also share running code, using Numpy, Pytorch, and the most prominent libraries adopted in this field, such as Deep Graph Library (DGL) and Pytorch Geometric. At the end of this series, you will be able to combine these building blocks and create a neural architecture to perform analysis and learning tasks on graph data.

This series will analyze the main components to set up a GNN, including (i) the input layer, (ii) the GNN layer(s), and (iii) the Multilayer Perceptron (MLP) prediction layer(s).

The framework to analyze and decompose the standard GNN architectures is based on the recent paper entitled “Benchmarking Graph Neural Networks”, whose metadata is available below:

Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y., & Bresson, X. (2020). Benchmarking Graph Neural Networks.arXiv preprint arXiv:2003.00982.https://arxiv.org/abs/2003.00982

Source:

This post does not cover the fundamentals of graph theory and neural networks. For an introduction to this topic, I suggest the following article:

## The GNN Architecture: Overview of the Main Components

The *input layer* defines the initial representation of graph data, which becomes the input to the GNN layer(s). Basically, the idea is to assign a feature representation to the nodes and edges of the graph.

The *GNN layer* encodes the information on the structure of the graph. Then, it exploits this information to update the initial representation of nodes and edges.

The *MLP prediction layer* performs a specific learning task, including *node classification* or *link prediction*, employing the encoded graph representation obtained as output from the GNN layer(s).

This post introduces the input layer and the main principles behind a GNN layer. The next articles will describe different types of GNN layers, explaining the related features and showing the main differences between them. In parallel, I will provide an overview of the traditional MLP prediction layers to perform specific tasks on graph data.

## The Input Layer

As mentioned before, the goal of the input layer is to define the initial representation of graph data, assigning features to nodes and edges. For the sake of simplicity, I currently consider only the node features.

The simplest way to represent nodes in the graph is by using *one-hot* vectors. This representation is usually adopted to distinguish different words in a vocabulary for NLP tasks. In our case, it is exploited to represent different nodes of the graph. The length of the vector representing each node is equal to the number of nodes and, for each vector, an element in a different position is set to 1, while the other elements are set to 0.

In order to clarify this representation, the following script creates a graph with 5 nodes, represented using one-hot vectors.

import numpy as npX = np.eye(5, 5)

n = X.shape[0]

np.random.shuffle(X)print(X)-- output:[[0. 0. 1. 0. 0.] # Node 1

[0. 0. 0. 1. 0.] # Node 2

[1. 0. 0. 0. 0.] # ..

[0. 0. 0. 0. 1.] # ..

[0. 1. 0. 0. 0.]] # Node 5

Each row of this matrix represents a node of the graph. To assign initial features to each of these nodes, the input layer applies a linear transformation (also called *projection*) to the one-hot vectors which encode the node representations. Giving a brief recap of the linear transformation, the definition is the following:

`Y = WX + b`

As reported by Dwivedi et al., the bias value *b* is not used for the linear transformation in the case of the one-hot vectors. Therefore, the following scripts perform the linear transformation, defining the output of the input layer, which will be passed to the GNN layer:

# Dimension of the node features (embedding)

emb = 3# Weight matrix (initialized according to Glorot & Bengio (2010))W = np.random.uniform(-np.sqrt(1. / emb), np.sqrt(1. / emb), (n, emb))print(W)-- output:

[[-0.34857891 -0.5419972 0.43603217]

[ 0.26261991 0.04720523 -0.42555547]

[-0.09968833 0.3218483 0.09688095]

[-0.36646565 0.37652735 -0.45564272]

[-0.24990413 -0.50164433 -0.51217414]]

--# Linear projection

L_0 = X.dot(W)print(L_0)-- output:[[-0.09968833 0.3218483 0.09688095]

[-0.36646565 0.37652735 -0.45564272]

[-0.34857891 -0.5419972 0.43603217]

[-0.24990413 -0.50164433 -0.51217414]

[ 0.26261991 0.04720523 -0.42555547]]

The projection step assigns a *d*-dimensional vector representation to each node in the graph. In this example, the 5-length one-hot vectors representing the nodes are mapped (or projected) into 3-length dense feature vectors.

Paraphrasing Dwivedi et al.:

The goal of the input layer is to embed the input features of nodes (and edges) to a d-dimensional vector of hidden features. This new representation is obtained via a simple linear transformation (also known as projection).

To clarify this aspect, you can analyze the following block:

# X: One-hot vectors representing the nodes

[[0. 0. 1. 0. 0.]# Node 1 - 1 element in the3rdposition

[0. 0. 0. 1. 0.]0 in the other positions

[1. 0. 0. 0. 0.]

[0. 0. 0. 0. 1.]

[0. 1. 0. 0. 0.]] # W: Weight matrix

[[-0.34857891 -0.5419972 0.43603217]

[ 0.26261991 0.04720523 -0.42555547][-0.09968833 0.3218483 0.09688095] # Emphasis to the3rdrow

[-0.36646565 0.37652735 -0.45564272]

[-0.24990413 -0.50164433 -0.51217414]]# L_0 (projection) = X.dot(W)

[[-0.09968833 0.3218483 0.09688095] #Features of Node 1,[-0.36646565 0.37652735 -0.45564272]represented by the3rdrow

[-0.34857891 -0.5419972 0.43603217]of the weight matrix

[-0.24990413 -0.50164433 -0.51217414]

[ 0.26261991 0.04720523 -0.42555547]]

The current features have been generated randomly. As a consequence, these features do not actually convey any type of information in regard to the nodes. However, these initial features of the nodes will be updated by means of two different steps:

- The aggregation of the features of neighbor nodes by means of the GNN layer(s).
- The training of the neural architecture for a specific purpose, by means of the MLP layer(s).

At the end of this twofold process, we will be able to obtain an *embedding* representation of the nodes, which will be characterized by features that convey specific information. In other words, the vector representation of the nodes will express meaningful information that, as human beings, we should be able to recognize observing the graph. In the simplest case, similar embedded features will be assigned to similar nodes in the graph.

## The GNN Layer

The goal of the GNN layer is to update the *d*-dimensional representation of the nodes obtained from the input layer. This goal is reached computing, as defined by Dwivedi et. al, a “recursive neighborhood diffusion”, through the so-called “message passing framework”. The main idea behind this framework is that each node feature is updated with the features of its neighbors. The neighbor features are *passed* to the target node as *messages *through the edges. As a consequence, the new representation of the node encodes and represents the local structure of the graph. To perform this step, we need a structure that describes the relations (edges) between the nodes in the graph. The adjacency matrix, which describes the relations between the nodes in a graph, helps us in this direction.

Consider the following script, which initializes a random adjacency matrix in a graph of 5 nodes:

# Randomly generated adjacency matrix

A = np.random.randint(2, size=(n, n))

np.fill_diagonal(A, 1) # Include the self loop# The following lines are a trivial ack to create a symmetric

# Adj matrix that defines the edges of an undirected

# graph of 5 nodes

A = (A + A.T)

A[A > 1] = 1print(A)-- output:[[1 1 1 0 1]# Connections to Node 1

[1 1 1 1 1]

[1 1 1 1 0]

[0 1 1 1 0]

[1 1 0 0 1]]

Each row of the adjacency matrix represents the connections, identified by the 1 element, to a node. For instance, the first row indicates that Node 1 is connected to itself, Node 2, Node 3, and Node 5. On the other side, Node 1 is not connected to node 4, because the value in the position (1, 4) is equal to 0.

Let’s see what happens when we multiply the adjacency matrix with the output of the input layer, which has applied the projection:

# A: Adjacency matrix

[[1 1 1 0 1]# Connections to Node 1

[1 1 1 1 1]

[1 1 1 1 0]

[0 1 1 1 0]

[1 1 0 0 1]]# L_0: Output from the input layer

[[-0.09968833 0.3218483 0.09688095] # Features of Node 1

[-0.36646565 0.37652735 -0.45564272]

[-0.34857891 -0.5419972 0.43603217]

[-0.24990413 -0.50164433 -0.51217414]

[ 0.26261991 0.04720523 -0.42555547]]# L_1 = A.dot(L_0)

[[-0.55211298 0.20358368 -0.34828506] # What is this?

[-0.8020171 -0.29806065 -0.86045919]

[-1.06463701 -0.34526588 -0.43490372]

[-0.96494868 -0.66711419 -0.53178468]

[-0.20353407 0.74558089 -0.78431723]]

To better understand what is happened to the node representations, consider the following script, which sums the *d*-dimensional representation of Node 1, to the d-dimensional representation of itself, Node 2, Node 3, and Node 5.

print(L_0[0, :] + L_0[1, :] + L_0[2, :] + L_0[4, :])-- output:[-0.55211298 0.20358368 -0.34828506]# L_1 = A.dot(L_0)

[[-0.55211298 0.20358368 -0.34828506] # Features of Node 1,

[-0.8020171 -0.29806065 -0.86045919]obtained summing the

[-1.06463701 -0.34526588 -0.43490372]features of local neighbors

[-0.96494868 -0.66711419 -0.53178468]

[-0.20353407 0.74558089 -0.78431723]]

As you can see, the updated vector representation of Node 1 corresponds to the *aggregation *(in this case a sum operation) of the features of the neighbors. In other words, this representation encodes the local structure of the graph.

One of the key ideas of this representation is that stacking *L* layers in the neural architecture, the resulting target node representation aggregates the features of nodes, whose distance is equal to *L *from the target node. This behavior is the result of the “recursive neighborhood diffusion”.

As underlined by Dwivedi et al.:

“Stacking

L GNNlayers allows the network to build node representations from theL-hopneighborhood of each node.”

The main differences between different GNN layers consist of the type of aggregation, which is performed by exploiting the local graph structure. In the simplest formulation of the GNN, such as the Vanilla Graph Convolutional Networks (GCNs), the aggregation/update is an *isotropic *operation. This means that the features of neighbor nodes are considered in the same way. More advanced neural architectures, such as the Graph Attention Network (GAT), introduce *anisotropic* operations, in which the contribution of each neighbor node in the aggregation is weighted according to its importance.

In the next post, I will introduce the GCN layer, describing also a specific extension for graphs with labeled edges (Knowledge Graphs), named Relational Graph Convolutional Network (R-GCN).