Sentiment Analysis using LSTM network

Sentiment analysis is basically a natural language processing problem where the underlying text is understood. Using which, it is determined whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining.

Why is sentiment analysis so important?

It can be used intuitively in a variety of ways-

  • Business can use this to develop their strategies based on consumer feelings. It can also be used to understand customer feelings on a particular brand or product.
  • It can turn out to be pretty effective tool to predict public sentiments if used properly.

LSTM with Keras

I will be using Keras to implement the LSTMs. Keras a high levl wrapper for Tensorflow, it can be run on a CPU or a GPU. Though GPU training is much faster and requires the Nvidia CUDA toolkit along with CUDNN.

Look at the Data

Below is a snapshot of the data-

First few rows

As seen above, the only important columns need are label and tweet. Tweet column contains the tweet and label represents if the tweet is a positive or negative tweet.

Data Preparation

Before proceeding with building the neural network, the data has to be converted to numbers so that the neural network can build a model. This can be done in many ways. But the easiest way is to use the Keras Tokenizer.

The data in stored in a train variable, this has to be moved to a data variable. After this has been accomplished, some of the special characters have to be removed.

Let’s start off by importing some important libraries.

Now comes an important part-Word Embedding. We have to map each tweet into a real vector known as word embedding with text is involved. This is a technique where words are encoded as real-valued vectors in a high dimensional space. Fortunately, Keras has has an easy and convenient way to do this. By using the embedding layer. We also have to use 128length vectors to represent each word. This will be in the first layer of the neural net. Also, we need to pick just the top 2000 words based on frequency. And the number of words in each tweet will be limited to 32 characters.

Let’s have a look at the cleaned data

Now let’s fit the Tokenizer on the data and use texts_to_sequences fit it on the tweet column.

The transformed data should resemble this.

Transformed Data

As seen above, the first tweet has 14 characters and the third tweet has 16 characters. This will lead to problems when training a neural net. Hence, all the tweets must truncated or padded with zeros.

Tweets at 32 characters

Building the LSTM network

The first layer of the neural net is the Embedded layer. This layer has 32 length vectors that represent each word. The next layer is the LSTM layer which has 200 memory units (can be increased if necessary). At the end is a Dense node with one output since this is a binary classification problem.

The model will use adam as the optimizer.

Next up is fitting the model to the training data, to avoid over fitting I have used the validation set to look at the performance of the model.

Model Fitting
Model Summary
Model Results

It can be clearly seen that the accuracy on the validation set is 94.66 at the end of the third epoch, which is pretty good considering that the training took 5 minutes.

Conclusion

The accuracy of the model can be further improved by using more LSTM memory units, though it will train slower. Training on GPU will be much faster than on a CPU. Also CNN networks train much faster than LSTMs. Finally over-fitting can be avoided by using dropout.

Source: Deep Learning on Medium