Text Classification using Graph Convolutional Networks

Original article was published by Berk Sudan on Deep Learning on Medium


Text Classification using Graph Convolutional Networks

> Author: Berk Sudan

Hello! I will talk about our AI project which we presented at Inzva AI Projects #4. Thanks to my teammate Güneykan Özgül, the coordinator of the program Almira Bağlar, our technical guide Uras Mutlu, and all Inzva Community for their precious contribution to the project.

Introduction: Text Classification & Representation

A Well-Known Problem: Text Classification
Text classification is one of the very famous and fundamental problem in Natural Language Processing. Various methods have been developed and seemingly there will be much more of them in future years.

Text Classification, basically, is predicting the label (or class) of some text using statistical and/or learning-based methods. The word text may sound like a document, but it does not have to be a document all the time. It can be also a syllable, a word, a sentence, a paragraph, a passage. Some very common applications of Text Classification are listed below:

  • Document Organization
  • News Filtering
  • Spam Detection
  • Opinion Mining

Text Representation
Text Representation is an essential intermediate step for Text Classification. Since the computers don’t know what letters and words are, all texts must be represented as a form of numbers to be understood and processed correctly. Accordingly, some traditional methods use hand-crafted features such as sparse lexical features to tackle this problem, e.g., Bag of Words, N-grams.

Deep Learning models such as LSTM (Hochreiter and Schmidhuber 1997) and CNNs (Kim 2014) are widely used in learning text representations. However, they have some pros and cons. They can capture semantic and syntactic information in local consecutive word sequences well, but may ignore global word co-occurrence in a corpus which carries non-consecutive and long-distance semantics (Peng et al. 2018).

So, in order to alleviate the cons of other deep learning models, Graph Neural Networks shows up! GNNs are effective when tasks thought to have rich relational structure and also can preserve global structure information of a graph in graph embeddings.

Theory: Graph Convolutional Networks & Graph Attentional Layers

Graph Convolutional Networks
Graph Convolutional Network is sister of Graph Neural Networks and contains additional graph convolutional layer in neural network similar to Convolutional Neural Networks.

Graph Convolutional Network on a text classification task first introduced in a paper called “Graph Convolutional Networks for Text Classification” (Liang Yao, Chengsheng Mao, Yuan Luo, 2019). As paper claims that GCN can work directly on a graphical representation of a document and induces embedding vectors of nodes based on properties of their neighborhoods.

Although GCNs are similar to CNNs, there are a few differences. But first, let’s review what convolution is, briefly. Convolution is, simply, a mixing operation among data points which have spatial or temporal localities. It means, a data point is not independent of all other data points all the time, instead it has bounds with the ones which are the nearest in terms of space or the closest in terms of time. In conventional CNNs, almost always, fixed-size kernels are used. Conventional CNNs are mostly popular in:

  • 1-dimensional time-series data (e.g., sound)
  • 2-dimensional data with spatial locality (e.g., image)
  • 3-dimensional data with spatial locality (e.g., Magnetic Resonance Imaging data)

It’s a difficult task to apply convolutional operation to Graph Convolutional Layers. Graph Convolutional Layers cannot use fixed-size kernels by nature, since they have no fixed amount of neighbors. Nodes in GCNs have no global locations, but relative locations. So the absolute location of a node cannot be defined directly, instead, locations of all nodes can be defined by their neighbor nodes (vertices) in graph network. Additionally, we cannot perfectly represent a graph with a tensor (vectors, matrixs, etc.).

Graphs are abstract data structures and can be represented with Adjacency Matrix which shows if two nodes has a connection or not. If graph is unweighted then the cell in adjacency matrix will be 0 in case of no connection, otherwise it will be 1. Three examples of adjacency matrix are shown below:

Locality Problem of GCN [Source]

When there are few nodes, such as 4, matrix does not happen to be sparse. But as the amount of nodes grows, matrix is getting sparser and sparser. It triggers the fundamental sparsity problem. Also, in order to apply convolution operation to a matrix (or a tensor), its columns and rows should not be interchangeable. In adjacency matrix all rows can be changed with each other and all columns can be changed with each other, yet the graph structure stays the same. So, we cannot find patterns directly on the adjacency matrix, since the 1’s which are close to each other mean anything special.

For multi-layer Graph Convolutional Network (GCN), Kipf and Welling introduced the following layer-wise propagation rule:

GCN Convolution Formula [Source]

The formula derived from first-order approximation of spectral graph convolutions with Fourier Transform. First part of the equation normalizes the adjacency matrix with its Degree Matrix. Degree Matrix is a diagonal matrix which shows the number of neighbors of each node. With help of this normalization, number of neighbors cannot be problematic for unfixed-size kernels. In second part, we multiply node properties (or node features) and the weight which comes from the previous layer. And, finally, we apply a non-linear activation function which is generally Leaky ReLu.

Graph Attentional Layer
GAT (Graph Attentional Layer) is subsidiary of the project along with Graph Convolutional Layer. Since we intended to boost the performance of GCN’s, we stumbled upon Graph Attentional Layers and Graph Attention Networks. Graph Attention Networks, first introduced by Velickovic in 2017 as a novel highly effective neural network architecture. It enables specifying different weights to different nodes in a neighbourhood in a graph. In this way, accuracy can be improved. Graph Attentional Layer uses an attention mechanism similar to the other attention mechanisms, but the difference is that this attention mechanism is used in graph neural networks. The basic idea is updating node features with attention heads and vectors.

Graph Attentional Layer Formula [Source]

Graph Attentional Layer has an attention vector a which is shared across the network. It’s a learnable parameter just like the weights. So that we can obtain self-attention heads for each 2 nodes which are neighbors. At the end, we get updated node features for every nodes.

Implemented Methods: Text-GCN and GAT

Text-GCN Model
We’ve implemented a Text-GCN model to learn word and document embeddings simultaneously. Because we want to demonstrate all structure well, we’ve constructed a single large graph from an entire corpus, which contains both words and documents as nodes. Our Text Graph Convolutional Network contains 2-layer, so that it can allow message passing among nodes that are at maximum two steps away.

There are no direct document-document edges in our graph and the two-layer Text-GCN allows the information exchange between pairs of documents. The second layer node (word/doc) embeddings have the same size as the labels set and are fed into a softmax classifier. A representation of our graph structure is shown below:

In this figure, doc₁ contains both word₁ and word₂ while doc₂ contains word₂ only. But there is no direct edge between doc₁ and doc₂. Edge between two word nodes is built using PMI (Pointwise Mutual Information), edge between a word node and document node is built using TF-IDF (Term Frequency — Inverse Document Frequency). PMI metric depicts word co-occurrence information while TF-IDF metric depicts word frequency and word’s document frequency.

GAT Model
GAT model operates on graph-structured data by leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions (or their approximations). It stacks layers in which nodes are able to attend over their neighborhoods’ features. And it enables (implicitly) specifying different weights to different nodes in a neighborhood without requiring any costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. We implemented a GAT model exact same as the paper suggested.

Evaluation and Results

For Text-GCN Model evaluation, first we pre-processed the data and then built the graph. In pre-processing step, we cleaned the data by string cleaning and tokenizing using regular expressions. Then, we removed stop words and rare words. Lastly, we shuffled the data to reduce bias. In building graph step, we extracted word features. Then we calculated PMI and TF-IDF scores.

For GAT Model evaluation, we implemented a network to learn attention coefficients to be able to specify different weights for different nodes in neighbourhood. Since we do not have graph data for each dataset, we only have results for citation graphs (cora, citeseer and pubmed data) for now.

For Text-GCN + GAT Model evaluation, we implemented GAT architecture that works on a graph consisting of words and documents rather than only documents. It uses attention coefficients to learn graph representation on which GCN operates. It currently requires some optimization and modifications to have consistent increase in accuracy.

We had 8 datasets in total:

  1. 20NG: Newsgroups dataset, 18846 documents, 20 labels. Training-Size: 11314, Test-Size: 7532
  2. Ohsumed: Cardiovascular diseases abstracts datasets, 7400 documents, 23 labels. Training-Size: 3357, Test-Size: 4043
  3. R8: Subset of Reuters 21578 dataset, 7674 documents, 8 labels. Training-Size: 5485, Test-Size: 2189
  4. R52: Subset of Reuters 21578 dataset, 9100 documents, 52 labels. Training-Size: 6532, Test-Size: 2568
  5. MR: Movie review dataset, 10662 documents, 2 labels. Training-Size: 7108, Test-Size: 3554
  6. Cora: Citation dataset, 2708 nodes, 5429 edges, 7 classes. Training-Size: 1707, Validation-Size: 189, Test-Size: 812
  7. Citeseer: Citation dataset, 3327 nodes, 4732 edges, 6 classes. Training-Size: 2097, Validation-Size: 232, Test-Size: 998
  8. Pubmed: Citation dataset, 19717 nodes, 44338 edges, 3 classes. Training-Size: 12422, Validation-Size: 1380, Test-Size: 5915

The first 5 datasets had no citations, so we had no information about document to document relations. But last 3 datasets have citations. In Graph Attentional Networks paper, only citation datasets (cora, citeseer, pubmed) were used in evaluations. For citation datasets, result are shown below:

Evaluation Result of Citation Datasets

We improved the accuracy of Pubmed using Text-GCN. In GAT and Classic GCN there were edges between docs-docs, however in Text-GCN, we only used the edges between docs-words and words-words.

When we applied Text-GCN to other 5 datasets, we got the same results with the one in Text-GCN paper. So the results are shown below:

Evaluation Results of Non-Citation Datasets (For Text-GCN Only)

Code is available on GitHub. Don’t hesitate to contribute to the project!

References

  • [Liang Yao, Chengsheng Mao, Yuan Luo, 2018] Graph
    Convolutional Networks for Text Classification
  • [Kipf and Welling, 2017] Semi-supervised classification with
    graph convolutional networks. InICLR.
  • [Ankit Pal] Multi-label Text Classification using Attention based
    Graph Neural Network
  • [Petar Velickovic] Graph Attention Networks