Text Based classification with RNN with google glove embedding

Original article was published on Deep Learning on Medium

Text Based classification with RNN with google glove embedding

Two of the most popular algorithms form deep learning for text based classifications are mainly Recurrent neural network and Long short term memory. We will be covering RNN in the article.

Recurrent Neural Network

It is different from other neural networks because it tends to keep a memory of all the previous sequences and keeps moving forward like this it is different from other feed-forward model because they don’t tend to keep any memory in between their layers.

Just like now you are reading a word by word and then your brains combines and tells you what all in are you reading.A recurrent neural network (RNN) has the same principle, it processes sequences by iterating through the sequence elements and maintaining a state containing information relative to what it has seen so far.

A pseudo code for RNN

In process RNN, from one state to another goes with each state with the memory of previous one

Implementation of RNN

We have keras layer for implementation of RNN with SimpleRNN , keras.layers import SimpleRNN , but before that we need data and google glove embedding. We will be using imdb data set from keras.

Brief about the dataset from keras documentaion

“This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: “only consider the top 10,000 most common words.”

Loading the data set
Extraction the word_index for further processing

Let’s now get the exact reviews for a better understanding , for that we may need a new dictionary just like above but reversed where the indexes are keys and words are the values in it.

First I created a new dictionary with keys as indexes and values as the word,After that for getting the reviews we used list comprehension with “.get(idx-3,_)” because if you see in the documentation ,The indices are offset by 3 by the keys 0, 1 and 2 which states 0 -> “padding” , 1->“start of sequence”, 2->“unknown” . These are reserved so we have to shift 3 indexes before for every index

Now let’s get to the glove embedding by google we will be using one with the 100D.

This “embadings” dictionary now have the all the 6 billion words with their 100 weights but we don’t need all of them we will be only using the one which we have in our imdb dataset.

We created a 2-d matrix of zeros with rows being the number of features which is 10000 and dimension as 100, and below that the result we obtained from doing the same.

Processing the data set

We have used the padding so that each review has same dimension and if the review if of less than 1000 then we can pad it with ‘0’ till 1000 so that each review is of same length and if its more than that then we can ignore anything after that.

Model Architecture

We will be using embedding which we just created specially for our dataset and we will feed it our model with max_features as 10000 , 100 being the dimension of every word’s weight and the input length as 1000.We used a SimpleRNN layer with output as 32 and after that we have used a dense layer with activation function of ‘Sigmoid’ because its binary classification problem positive/negative either 0 or 1.

As we already have our weights for the first layer which is the embedding_matrix we specially created for our model, We will be using that and set trainable to false so that our model don’t train that and just use its weights.

Training the model

Final Outcome

After running 10 epochs we got training accuracy of almost ~70% with validation of ~59%

Final Thoughts

The accuracy is not pretty good and we can improve it some ways, for starter we may use LSTM or GRU we may also use stacked RNN and we may decrease the length of max_len which increases the complexity as the words increases and so does the feature.

Stacked RNN