Source: Deep Learning on Medium
An Introduction to Graph Attention Networks
Graph Learning is increasingly becoming more and more relevant as a significant amount of real-world data can be modelled as graphs.
The Graph Attention Network or GAT is a non-spectral learning method which utilizes the spatial information of the node directly for learning. This is in contrast to the spectral approach of the Graph Convolutional Network which mirrors the same basics as the Convolutional Neural Net.
In this article, I will explain how the GAT is constructed.
The basic building block of the GAT is the Graph Attention Layer. To explain the following graph is used as an example.
Here hi is a feature vector of length F.
Step 1: Linear Transformation
The first step performed by the Graph Attention Layer is to apply a linear transformation — Weighted matrix W to the feature vectors of the nodes.
Step 2: Computation of Attention Coefficients
Attention Coefficients determine the relative importance of neighbouring features to each other. They are calculated using the formula.
Here a is a function that we determine subject to the following restriction.
i and j are neighbouring nodes.
The following figures further explain this.
As we observe in the series of images above, we first calculate the self-attention coefficient and then compute attention coefficients with all of the neighbours.
Step 3: Normalization of Attention Coefficients
Due to the varied structure of graphs, nodes can have a different number of neighbours. To have a common scaling across all neighbourhoods, the Attention coefficients are Normalized.
Here N is the Neighbourhood of node i.
Step 4: Computation of Final Output Features
Now we compute the learned features of nodes. σ is a Non-Linear Transformation.
Step 5: Computation of Multiple Attention Mechanisms
To improve the stability of the learning process, Multi-head attention is employed. We compute multiple different attention maps and finally aggregate all the learned representations.
K denotes the number of independent attention maps used.