Classifying Arxiv Research Papers Using LSTM

Source: Deep Learning on Medium

Go to the profile of Darius Atmar

(The code in this implementation closely follows Susan Li’s Multi-Class Text Classification with LSTM.)

In 2005, a paper titled “Rooter: A Methodology for the Typical Unification of Access Points and Redundancy,” was accepted as a non-reviewed paper to the World Multiconference on Systemics, Cybernetics and Informatics (WMSCI). The paper was actually gibberish, generated by a program developed by a few MIT students looking to have a laugh.

Since then a lot of less-than-serious (and truly entertaining) work has been done on paper generation, and deep learning has helped immensely. The improvement of recurrent neural net (RNN) architectures has made it increasingly easy to train text-to-text models on the vast available corpus of research papers.

Instead of reproducing a paper-generation model, I’ve elected to train a classifier to predict the topic of a research paper instead. My reasoning behind the decision was this: while taking both computer vision and deep learning courses concurrently, I was reading a lot of papers in both areas (which have a huge amount of overlap). So I decided to see if I could train a model to distinguish a computer-vision specific research paper from a deep learning paper. Below, I outline my data collection/cleaning/pre-processing methods, model architecture, and results.

I decided to build my dataset from open papers on, using CORE’s api to make requests. I built my api requestor by adapting ronentk’s sci-paper-miner, to my needs, which was in turn adapted from CORE’s demo requestor.

Given two topics, returns the full text of 10,000 articles on each topic from

The text is then cleaned to remove non-ascii characters, and pickled for future use to avoid the need to repeat the (long) api requesting process. Given a word cloud for each topic indicates that deep learning will be an easier class to predict than computer vision, given the frequency of the words “deep” and “learning” in those papers.

Word Cloud for Deep Learning Papers (source: wordcloud code, Aashita Kesarwani)
Word Cloud for Computer Vision Papers (source: wordcloud code, Aashita Kesarwani)

The data is then piped to

First, the full text of each article needs to be tokenized before it can be used as an input in our Keras model. Tokenization as it relates to NLP is the process of breaking down text into linguistically significant, methodologically useful parts, named tokens. In our case, we simply treat each space-delimited word as a token, filtering out all punctuation. This results in X, our Numpy array with each row containing tokenized sequences of the full text of an article.

We embed each input text to a word embedding of shape (sequence_length x 100). Word embeddings represent words as projections onto a continuous vector space (as opposed to sparse representations provided by bag of words models). We then feed this embedding to a Long Short-Term Memory (LSTM) layer, and finally a dense layer for binary prediction. One major change made from Li’s implementation was the removal of a SpatialDropout1D.

Fig. 1: The repeating module in an LSTM Image source: Christopher Olah

Let’s decode these dropouts real quick. The LSTM layer uses ‘dropout’ at a rate of 0.25. This means that the one quarter of the weights that transform the input, x_t, (see Fig. 1) are being randomly zeroed out at each epoch. So what is ‘recurrent dropout’? Recurrent dropout, as opposed to just ‘dropout’, zeroes out one quarter of the weights that transform the recurrent state, which is the top horizontal line that flows across each repeating cell and acts as the long-term memory of the network. Both of these dropouts function to reduce overfitting, which means relying too heavily on the training data.


My results were a little bit puzzling at first glance. The model seems to be overfitting the training data quite quickly. Training loss drops nearly to 0.05 from 0.5 after 20 epochs, but validation loss increases from 0.5 to 0.8. However, while training accuracy increased to 97.5% after 20 epochs, validation accuracy remained steady at around 82%.This is surprising to say the least, and in retrospect I should have applied my early stopping criteria to loss instead of accuracy, given that the loss is binary crossentropy.

Training Accuracy vs Validation Accuracy per Epoch
Training Loss vs Validation Loss per Epoch

The final model’s test accuracy was 82.03%, while its loss was 0.88.

The overfitting issue starts to become clearer when we look at the prediction values on the test set. The vast majority of papers were classified correctly, with very high confidence. However, the majority of incorrect classifications were wrong with very high confidence.

Test predictions of test data

It is interesting to see the types of papers that get misclassified with low confidence. One deep computer vision paper that was misclassified as deep learning had 22 occurrences of the word “learning” in it, and 31 occurrences of the word “kernel”. The former is probably what tipped the network to predict deep learning, while the latter is a term used more often in computer vision than deep learning (the deep learning community has adopted the synonym “filter”).