An Introduction to Graph Attention Networks

Source: Deep Learning on Medium

An Introduction to Graph Attention Networks

Graph Learning is increasingly becoming more and more relevant as a significant amount of real-world data can be modelled as graphs.

A Visual Representation of the Cora Dataset as a Graph. Reference [1].

The Graph Attention Network or GAT is a non-spectral learning method which utilizes the spatial information of the node directly for learning. This is in contrast to the spectral approach of the Graph Convolutional Network which mirrors the same basics as the Convolutional Neural Net.

In this article, I will explain how the GAT is constructed.

The basic building block of the GAT is the Graph Attention Layer. To explain the following graph is used as an example.

An Example Graph

Here hi is a feature vector of length F.

Step 1: Linear Transformation

The first step performed by the Graph Attention Layer is to apply a linear transformation — Weighted matrix W to the feature vectors of the nodes.

Step 2: Computation of Attention Coefficients

Attention Coefficients determine the relative importance of neighbouring features to each other. They are calculated using the formula.

Here a is a function that we determine subject to the following restriction.

i and j are neighbouring nodes.

The following figures further explain this.

Individual Steps in the calculation of attention coefficients for each pair of Neighbouring nodes

As we observe in the series of images above, we first calculate the self-attention coefficient and then compute attention coefficients with all of the neighbours.

Step 3: Normalization of Attention Coefficients

Due to the varied structure of graphs, nodes can have a different number of neighbours. To have a common scaling across all neighbourhoods, the Attention coefficients are Normalized.

Normalization of Calculated Attention Coefficient

Here N is the Neighbourhood of node i.

Step 4: Computation of Final Output Features

Now we compute the learned features of nodes. σ is a Non-Linear Transformation.

Example Network Architecture
Computation of Learned Output Features

Step 5: Computation of Multiple Attention Mechanisms

To improve the stability of the learning process, Multi-head attention is employed. We compute multiple different attention maps and finally aggregate all the learned representations.

Final Computation of Learned Features

K denotes the number of independent attention maps used.

An overall look at all operations involved in the update of learned features of nodes. Reference [1].