Sentiment analysis for text with Deep Learning

Source: Deep Learning on Medium


I started working on a NLP related project with twitter data and one of the project goals included sentiment classification for each tweet. However when I explored the available resources such as NLTK sentiment classifier and other resource available in python, I was disappointed by the performance of these models. At most I would get about 60% to 70% accuracy on binary classification (i.e only positive or negative class) tasks.

Hence I started researching about ways to increase my model performance. One of the obvious choices was to build a deep learning based sentiment classification model.

I am writing this blog post to share about my experience about steps to building a deep learning model for sentiment classification and I hope you find it useful. The link to the code repository can be found here.

I have designed the model to provide a sentiment score between 0 to 1 with 0 being very negative and 1 being very positive. This was done by building a multi-class classification model i.e 10 class, one class for each decile.

There are 5 major steps involved in the building a deep learning model for sentiment classification:

Step1: Get data.

Step 2: Generate embeddings

Step 3: Model architecture

Step 4: Model Parameters

Step 5: Train and test the model

Step 6: Run the model

I am going to cover each of the above steps in detail below.

Step1: Get data

Sourcing the labelled data for training a deep learning model is one of the most difficult parts of building a model. Fortunately we can use the Stanford sentiment treebank data for our purpose.

The data set “dictionary.txt” consists of 239,233 lines of sentences with an index for each line. The index is used to match each of the sentences to a sentiment score in the file “labels.txt”. The score ranges from 0 to 1, 0 being very negative and 1 being very positive.

The below code reads the dictionary.txt and labels.txt files, combines the score to each sentences . This code is found within train/utility_function.py

def read_data(path):
# read dictionary into df
df_data_sentence = pd.read_table(path + ‘dictionary.txt’)
df_data_sentence_processed = df_data_sentence[‘Phrase|Index’].str.split(‘|’, expand=True)
df_data_sentence_processed = df_data_sentence_processed.rename(columns={0: ‘Phrase’, 1: ‘phrase_ids’})
# read sentiment labels into df
df_data_sentiment = pd.read_table(path + ‘sentiment_labels.txt’)
df_data_sentiment_processed = df_data_sentiment[‘phrase ids|sentiment values’].str.split(‘|’, expand=True)
df_data_sentiment_processed = df_data_sentiment_processed.rename(columns={0: ‘phrase_ids’, 1: ‘sentiment_values’})
#combine data frames containing sentence and sentiment
df_processed_all = df_data_sentence_processed.merge(df_data_sentiment_processed, how=’inner’, on=’phrase_ids’
return df_processed_all

The data is split into 3 parts:

  • train.csv : This is the main data which is used to train the model. This is 50% of the overall data.
  • val.csv : This is a validation data set to be used to ensure the model does not overfit. This is 25% of the overall data.
  • test.csv : This is used to test the accuracy of the model post training. This is 25% of the overall data.

Step 2: Generate embeddings

Prior to training this model we are going to convert each of the words into a word embedding. You can think of word embeddings as numerical representation of words to enable our model to learn. For more details on word embeddings please read this blog.

What are word embeddings?

They are vector representations that capture the context of the underlying words in relation to other words in the sentence. This transformation results in words having similar meaning being clustered closer together in the hyperplane and distinct words positioned further away in the hyperplane.

How are we going to convert each word into a word embeddings?

We are going to use a pre-trained word embedding model know as GloVe. For our model we are going to represent each word using a 100 dimension embedding. The detailed code for converting the data into word embedding is in within train/utility_function.py. This function basically replace each of the words by its respective embedding by performing a lookup from the GloVe pre-trained vectors. An illustration of the process is shown below, where each word is converted into an embedding and fed into a neural network.

Converting Sentences to Embedding for a Neural Network (Source)

The below code is used to split the data into train, val and test sets. Also the corresponding embeddings for the data is stored in the weight_matrix variable.

def load_data_all(data_dir, all_data_path,pred_path, gloveFile, first_run, load_all):
 numClasses = 10
# Load embeddings for the filtered glove list
if load_all == True:
weight_matrix, word_idx = uf.load_embeddings(gloveFile)
else:
weight_matrix, word_idx = uf.load_embeddings(filtered_glove_path)
 # create test, validation and trainng data
all_data = uf.read_data(all_data_path)
train_data, test_data, dev_data = uf.training_data_split(all_data, 0.8, data_dir)
train_data = train_data.reset_index()
dev_data = dev_data.reset_index()
test_data = test_data.reset_index()
 maxSeqLength, avg_words, sequence_length = uf.maxSeqLen(all_data)
# load Training data matrix
train_x = uf.tf_data_pipeline_nltk(train_data, word_idx, weight_matrix, maxSeqLength)
test_x = uf.tf_data_pipeline_nltk(test_data, word_idx, weight_matrix, maxSeqLength)
val_x = uf.tf_data_pipeline_nltk(dev_data, word_idx, weight_matrix, maxSeqLength)
 # load labels data matrix
train_y = uf.labels_matrix(train_data)
val_y = uf.labels_matrix(dev_data)
test_y = uf.labels_matrix(test_data)
return train_x, train_y, test_x, test_y, val_x, val_y, weight_matrix

Step3: Model architecture

In order to train the model we are going to use a type of Recurrent Neural Network, know as LSTM (Long Short Term Memory). The main advantage of this network is that it is able to remember the sequence of past data i.e. words in our case in order to make a decision on the sentiment of the word.

A RNN Network (Source)

As seen in the above picture it is basically a sequence of copies of the cells, where output of each cell is forwarded as input to the next. LSTM network are essentially the same but each cell architecture is a bit more complex. This complexity as seen below allows the each cells to decide which of the past information to remember and the ones to forget, if you want more information on the inner working of a LSTM please go to this amazing blog (The illustrations are sourced from this blog).

A LSTM Cell (Source)

We are going to create the network using Keras. Keras is built on tensorflow and can be used to build most types of deep learning models. We are going to specify the layers of the model as below. In order to estimate the parameters such as dropout, no of cells etc I have performed a grid search with different parameter values and chose the parameters with best performance.

Layers:

Model Architecture

Layer 1: An embedding layer of a vector size of 100 and a max length of each sentence is set to 56.

Layer 2: 128 cell bi-directional LSTM layers, where the embedding data is fed to the network. We add a dropout of 0.2 this is used to prevent overfitting.

Layer 3: A 512 layer dense network which takes in the input from the LSTM layer. A Dropout of 0.5 is added here.

Layer 4: A 10 layer dense network with softmax activation, each class is used to represent a sentiment category, with class 1 representing sentiment score between 0.0 to 0.1 and class 10 representing a sentiment score between 0.9 to 1.

Code to create an LSTM model in Keras:

import os
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.layers import Bidirectional
from keras.preprocessing import sequence
from keras.layers import Dropout
from keras.models import model_from_json
from keras.models import load_model

def create_model_rnn(weight_matrix, max_words, EMBEDDING_DIM):
# create the model
model = Sequential()
model.add(Embedding(len(weight_matrix), EMBEDDING_DIM, weights=[weight_matrix], input_length=max_words, trainable=False))
model.add(Bidirectional(LSTM(128, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.50))
model.add(Dense(10, activation='softmax'))
# Adam Optimiser
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])
return model

Step 4: Model Parameters:

Activation Function: I have used ReLU as the activation function. ReLU is a non-linear activation function, which helps complex relationships in the data to be captured by the model.

Optimiser: We use adam optimiser, which is an adaptive learning rate optimiser.

Loss function: We will train a network to output a probability over the 10 classes using Cross-Entropy loss, also called Softmax Loss. It is very useful for multi-class classification.

Step 5: Train and test the model

We start the training of the model by passing the train, validation and test data set into the function below:

def train_model(model,train_x, train_y, test_x, test_y, val_x, val_y, batch_size):
# save the best model and early stopping
saveBestModel = keras.callbacks.ModelCheckpoint('../best_weight_glove_bi_100d.hdf5', monitor='val_acc', verbose=0, save_best_only=True, save_weights_only=False, mode='auto', period=1)
earlyStopping = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
# Fit the model
model.fit(train_x, train_y, batch_size=batch_size, epochs=25,validation_data=(val_x, val_y), callbacks=[saveBestModel, earlyStopping])
# Final evaluation of the model
score, acc = model.evaluate(test_x, test_y, batch_size=batch_size)
return model

I have run the training on a batch size of 500 items at a time. As you increase the batch size the time for training would reduce but it will require additional computational capacity. Hence it is a trade-off between computation capacity and time for training.

The training is set to run for 25 epochs. One epoch would mean that the network has seen the entire training data once. As we increase the number of epochs there is a risk that the model will overfit to the training data. Hence to prevent the model from overfitting I have enabled early stopping.

Early stopping is a method that allows us to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out/validation dataset.

The model on the test set of 10 class sentiment classification provides a result of 48.6% accuracy. The accuracy will be much more higher on a 2 class binary (positive or negative) data set.

Step 6: Run the model

Once the model is trained you can save the model in keras using the below code.

model.save_weights("/model/best_model.h5")

The next step is to use the trained model in real time to run predictions on new data. In order to do this you will need to transform the input data to embeddings, similar to the way we treated our training data. The function live_test below performs the required pre-processing of the data and returns the result of the trained model.

Here in order to ensure robustness of the results of the model I am taking the average top 3 sentiments bands from the model. This provides a better calibration for the model results.

def live_test(trained_model, data, word_idx):
live_list = []
live_list_np = np.zeros((56,1))
# split the sentence into its words and remove any punctuations.
tokenizer = RegexpTokenizer(r'\w+')
data_sample_list = tokenizer.tokenize(data)
labels = np.array(['1','2','3','4','5','6','7','8','9','10'], dtype = "int")
# get index for the live stage
data_index = np.array([word_idx[word.lower()] if word.lower() in word_idx else 0 for word in data_sample_list])
data_index_np = np.array(data_index)
# padded with zeros of length 56 i.e maximum length
padded_array = np.zeros(56)
padded_array[:data_index_np.shape[0]] = data_index_np
data_index_np_pad = padded_array.astype(int)
live_list.append(data_index_np_pad)
live_list_np = np.asarray(live_list)
# get score from the model
score = trained_model.predict(live_list_np, batch_size=1, verbose=0)
single_score = np.round(np.argmax(score)/10, decimals=2) # maximum of the array i.e single band
# weighted score of top 3 bands
top_3_index = np.argsort(score)[0][-3:]
top_3_scores = score[0][top_3_index]
top_3_weights = top_3_scores/np.sum(top_3_scores)
single_score_dot = np.round(np.dot(top_3_index, top_3_weights)/10, decimals = 2)
return single_score_dot, single_score

As seen in the code below, you can specify the model path, sample data and the corresponding embeddings to the live_test function. It will return the sentiment of the sample data.

# Load the best model that is saved in previous step
weight_path = '/model/best_model.hdf5'
loaded_model = load_model(weight_path)
# sample sentence
data_sample = "This blog is really interesting."
result = live_test(loaded_model,data_sample, word_idx)

Results

Let us compare the results of our deep learning model to the NLTK model by taking a sample.

LSTM Model: This sentence “Great!! it is raining today!!” contains negative context and our model is able to predict this as seen below. it gives it a score of 0.34.

LSTM Model

NLTK Model: The same sentence when analysed by the bi-gram NLTK model scores it as being positive with a score 0.74.

NLTK Model

Hurray !! This comes to the end of the tutorial of creating a deep learning sentiment classification model for text data.

You can download the source code from github and play around to train the network on your own data. I will cover on how to deploy this model on scale using dockers and api service in a separate blog.