Sentiment Analysis using Deep Learning techniques with India Elections 2019 — A Case study

Source: Deep Learning on Medium


Motivation

India’s most anticipated events of 2019 — General Elections of Lok Sabha is knocking our doors! The prominent parties standing for the elections, party leaders and representatives have a busy schedule organizing campaigns and convincing people to vote. While the media is busy capturing all events starting from press conferences to any gatherings, and putting it in front of public, the public is deeply engrossed with latest news and developments.

The phenomenal growth in real time data tracking and analyzing techniques has inspired data scientists to visualize and predict sentiments, build real-time models to predict the winners, etc.

Trust me , the most exciting part of it is capturing the information online from all sources and predict in real time with highest accuracy. The great challenge in this scenario is the accuracy and ever increasing length of date getting flooded from all sources every second. With the current challenges in view, I decided to use few Deep Learning ML techniques to predict moods using Twitter data.

Note that this article assumes a basic knowledge of data science and NLP (Natural Language Processing). But if you are a newcomer to this world, I have provided links throughout the article to help you out. This blog is structured like this:

  • Describe deep learning algorithms, LSTM, Bi-directional LSTM, Bi-directional GRU, CNN.
  • Train these algorithms using contextual election corpus as well as pre-trained word embeddings to predict sentiments of electing parties.
  • Comparing the accuracy and log loss of different models.

Glove Pre-trained Word Embeddings

Glove: Pre-trained Word Embeddings , Source :https://nlp.stanford.edu/projects/glove/

We started our sentiment classification technique with Google’s pre-trained Word2Vec model that represents words as vectors, built on the basis of aggregated global word-word co-occurrence statistics from a corpus. The Word2Vec model, trained by Google predicts words close to the target word with a neural network to represent a linear substructures of the word vector space.

As we represent each word with a vector and a sentence (tweet) as an average of its words (vectors) to illustrate its sentiment, it becomes obvious to train the word vector with different moods to aid in the classification and prediction process. As such, Word2Vec is trained with different RNN models.

Recurrent Neural Networks

A recurrent neural network (RNN) is a sequence of inter-linked artificial neural network where connections between nodes form a directed graph along a sequence. They are particularly known for processing data related to sequence : text, time series, videos, etc where the output at any given instant t is affected by the output at previous instant t-1 along with the input at t.

Source : https://www.analyticsvidhya.com/blog/2017/12/introduction-to-recurrent-neural-networks/

We will see how RNN based models (LSTM, GRU, Bi-directional LSTM) perform with an external embedding which has been trained and distilled on a very large corpus of data as well as with an internal embedding, where a part of the contextual corpus has been considered for training.

Basic RNNs suffers from vanishing and exploding gradient problems for which LSTM based networks have evolved to handle this problem.

Auto-encoder

Auto-encoders are a special type of RNN known for compressing a relatively long sequence into a limited, fixed-size, dense vector. They are well known for classifying textual sentiments and hence used here for the same purpose for training and predicting mood categories for election tweets.

An auto-encoder attempts to copy its input to its output through an encoder and decoder architecture. The dimension of the middle-hidden layer is lower than that of the input data. Thus, the neural network is designed to represent the input in a smart and compact way in order to reconstruct it successfully.

The AutoEncoders used here follow simple Sequnce2Sequence architecture built from an input layer followed by encoding LSTM layer, an embedding layer, decoding LSTM layer, and a softmax layer. Both the input and the output of the entire architecture are vectorized representation of the tweets and their labelled sentiments. Finally, the output of the LSTM is passed through softmax activation to represent the sentiment category.

Auto-Encoder Source : https://www.eurekalert.org/multimedia/pub/129766.php
Auto-Encoder Training with Pre-trained Glove

LSTM

LSTMs, kind of Recurrent Neural Networks possess internal contextual state cells that act as long-term or short-term memory cells. LSTMs solve many problems of vanilla Recurrent Neural Networks by :

  • Helping to preserve a constant error, by continuous learning and back propagation through time and layers.
  • LSTMs contain gated cell that controls flow of information. Gated cells remain responsible for information read, write and storage. They remain primary decision makers to retain cell state information (input gate), to determine the amount of cell state to pass on to next neural network layers (output gate) and amount of existing information from memory that can be forgotten (forget gate).
  • Gates in LSTMs contain analog information ranging from 0 to 1 through sigmoid activation functions. The analog information flow in gates facilitates back propagation to happen through multiple bounded nonlinearities.
  • LSTM solves vanishing gradient problem by keeping the gradients steep enough, therefore training relatively short batches with high accuracy.

The below figure shows how word embedding can feed an input sentence to LSTM. The LSTM layers takes into consideration previous hidden state to extract the key feature vectors that determines the sentiment of the sentence.

The source code below shows how to build a Word Embedding with single hidden layer LSTM of 128 neurons and classify tweets based on predefined classes using “softmax” classifier and “Adam” optimizer. Source code available at https://github.com/sharmi1206/elections-2019

#fileName classifyw2veclstm.py
NO_CLASSES = 8
embedded_sequences = embedding_layer(sequence_input)
l_lstm = LSTM(128)(embedded_sequences)
preds = Dense(NO_CLASSES, activation='softmax')(l_lstm)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])
model.summary()
model.fit(x_train, y_train,
nb_epoch=15, batch_size=64)
output_test = model.evaluate(x_test, y_test, verbose=0)
Model Summary with single Layer LSTM

GRU

GRU is just a slightly modified version of the LSTM to capture the dependencies between time instances adaptively.

  • Absence of a memory unit like LSTM makes it incapable to control the flow of information like the LSTM unit.
  • GRU functions with “reset” and “update” gate. The reset gate remains located between the previous activation and the next candidate activation to allow forget from previous state. The update gate decides how much of the candidate activation to use in updating the cell state.
  • Possesses fewer parameters and thus may train a bit faster or need less data to generalize.
  • Falls short to LSTM in processing larger datasets where LSTMs have shown to perform better.

The source code below shows how to build a GRU with a single hidden layer and classify tweets using “softmax” classifier and “Adam” optimizer.

#fileName classifyw2veclstm.py at https://github.com/sharmi1206/elections-2019
NO_CLASSES = 8
embedded_sequences = embedding_layer(sequence_input)
l_lstm = GRU(128)(embedded_sequences)
preds = Dense(NO_CLASSES, activation='softmax')(l_lstm)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])
model.summary()
model.fit(x_train, y_train,
nb_epoch=15, batch_size=64)
output_test = model.evaluate(x_test, y_test, verbose=0)
Model Summary with single Layer GRU

Bi-directional LSTM

Bidirectional Recurrent Neural Networks (BRNN) connects two hidden layers of opposite directions to the same output, thus increasing the amount of input information available to the network. This architecture facilitates the output layer to get information from past (backwards) and future (forward) states simultaneously.

BRNN has been used in analyzing public sentiments towards elections as the election context is fed as its input and BRNN has increased performance when the knowledge of words proceeding and following the most polarized word is taken into consideration from either directions. BRNN aims to :

  • Divide the neurons of a regular RNN into two directions, one for positive time direction (forward states), and another for negative time direction (backward states). This facilitates information inclusion from both past and future of the current time frame.
  • Output of two states remain disconnected with the inputs of the opposite direction states.

BRNNs can be trained using similar algorithms to RNNs, because the training process does not involve any interactions between both the directional neurons. The training involves three steps with forward pass, backward pass and weight updates:

  • For forward pass, forward states and backward states are passed first, then output neurons are passed.
  • For backward pass, output neurons are passed first, then forward states and backward states are passed next
  • After forward and backward passes are done, the weights are updated.
Bi-directional LSTM model summary

Convolutional Neural Networks (CNN)

CNN used for sentiment prediction using pre-trained word embeddings is composed of 1D convolution layers and 1D Global Max Pooling layers with 128 filters.1D convolution layer in the network performs convolutions (feature mapping) over the ordered embedded word vectors in a sentence using filter size of 5, sliding over 5 words at a time.

Single layer CNN with 128 filters
#fileName classifygloveattlstm.py at https://github.com/sharmi1206/elections-2019
model = Sequential()
model.add(layers.Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(Dense(8, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train,
nb_epoch=15, batch_size=64,
validation_data=(x_test, y_test))

loss, accuracy = model.evaluate(x_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(x_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))

LSTM, Bi-directional LSTM, Bi-directional GRU with Attention Mechanism

Attention mechanisms allow neural networks to decide which vectors (or words) from the past are important for future decisions by considering them in context to the word in question. In this process, it filters important and relevant chunks of information, and force hops in parts of the sequence that is not relevant to the final goal or task. Such relationships among words and connection to neighboring words can be represented by directed arcs of a semantic dependency graph.

Further, an attention mechanism takes into account the input from several time steps, distributes attention over the hidden states by assigning different weights, or degrees of importance, to those inputs. For a fixed target word, the first task is to loop over all encoders’ states to compare target and source states to generate scores for each state in encoders. A softmax is then introduced to normalize all scores, which generates the probability distribution conditioned on target states. At last, the weights are introduced to make context vector easy to train.

The principle advantage of attention mechanism lies in the context vector’s ability to take all cells’ outputs as input to compute the probability distribution of source, providing the decoder an ability to represent global information, instead of a single hidden state.

Bi-directional GRU and LSTM networks with Attention mechanism
Model Summary Bi-directional LSTM/GRU with Attention layer

The source code below shows how to build a single Bi-directional GRU layer, with Attention layer of 64 neurons, and classify tweets based on predefined classes using “softmax” classifier and “Adam” optimizer. Source code available at https://github.com/sharmi1206/elections-2019

#fileName classifygloveattlstm.py at https://github.com/sharmi1206/elections-2019
from keras.layers import Dense
from keras.layers import GRU, Bidirectional, Embedding
from keras.models import Model
from sklearn.metrics import log_loss, accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix
NO_CLASSES = 8
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
l_gru = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)
#Refr:https://github.com/richliao/textClassifier/issues/28
l_att = AttLayer(64)(l_gru)
preds = Dense(NO_CLASSES, activation='softmax')(l_att)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])
model.summary()
model.fit(x_train, y_train,
nb_epoch=15, batch_size=64)
#Evaluate model Accuracy
output_test = model.predict(x_test)
final_pred = np.argmax(output_test, axis=1)
org_y_label = [np.where(r==1)[0][0] for r in y_test]
results = confusion_matrix(org_y_label, final_pred)
precisions, recall, f1_score, true_sum = metrics.precision_recall_fscore_support(org_y_label, final_pred)

pred_indices = np.argmax(output_test, axis=1)
classes = np.array(range(0, NO_CLASSES))
preds = classes[pred_indices]
print('Log loss: {}'.format(log_loss(classes[np.argmax(y_test, axis=1)], output_test)))
print('Accuracy: {}'.format(accuracy_score(classes[np.argmax(y_test, axis=1)], preds)))

Accuracy with Pre-trained Word Embeddings

Accuracy and Log Loss for sentiment prediction BJP vs Congress

Word Embeddings with Convolutional Neural Networks (CNN) on Election Tweets

Convolution Neural Networks with Word2Vec Models with Gensim by building the election corpus

The word2vec tool takes a text corpus (list of tweets) as input and produces the word vectors as output. It first constructs an unique vocabulary set from the training text data (list of tokenized tweets) and then learns vector representation of words, representing n-gram features that aids in sentiment classification process. The process is known as word embedding as used in pre-trained word embeddings, the only difference being the training process takes place using election tweets instead of pre-trained data. We used Keras to convert positive integer representations of words into a word embedding by an Embedding layer.

#fileName classifyw2veccnn.py at https://github.com/sharmi1206/elections-2019
num_words = 20000
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(combined_df['tweet'].values)
word_index = tokenizer.word_index

# Pad the tweet data
X = tokenizer.texts_to_sequences(combined_df['tweet'].values)
X = pad_sequences(X, maxlen=2000)
Y = pd.get_dummies(combined_df['mood']).values
word2vec = Word2Vec(sentences=tokenized_corpus,
size=vector_size,
window=window_size,
iter=500,
seed=300,
workers=multiprocessing.cpu_count())
# Copy word vectors 
X_vecs = word2vec.wv

CNN used for sentiment prediction is composed of 1D convolution layers and 1D pooling layers over a series of 4 layers, with 32, 64, 128 and 256 filters respectively in each layer.

1D convolution layer in the network performs convolutions (feature mapping) over the ordered embedded word vectors in a sentence using filter size of 3, sliding over 3 words at a time. This allows considering at 3-grams to understand how words contribute to sentiment in the context of those around them.

After each convolution, we add a max-pool layer to extract the most significant elements and turn them into a feature vector. Further, we also add a regularization of 20% to ensure the model does not overfit. The resultant tensor of varying shape is concatenated into one big, single columned vector through flattening. The long feature vector is then used by dense layer with software activation to yield a resultant classified output.

#fileName classifyw2veccnn.py at https://github.com/sharmi1206/elections-2019
from keras.layers.core import Dense, Dropout, Flatten
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.optimizers import Adam
from keras.models import Sequential
batch_size = 64
nb_epochs = 20
vector_size = 512
max_tweet_length = 100
model = Sequential()
model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same', input_shape=(max_tweet_length, vector_size)))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.2))
model.add(Conv1D(64, kernel_size=3, activation='elu', padding='same'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.2))
model.add(Conv1D(128, kernel_size=3, activation='elu', padding='same'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.2))
model.add(Conv1D(256, kernel_size=3, activation='elu', padding='same', input_shape=(max_tweet_length, vector_size)))
model.add(Dropout(0.2))
model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())
model.add(Dense(8, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.001, decay=1e-6),
metrics=['accuracy'])

# Fit the model
model.fit(X_train, Y_train,
batch_size=batch_size,
shuffle=True,
epochs=nb_epochs)
model.add(Flatten())
model.add(Dense(8, activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.001, decay=1e-6),
metrics=['accuracy'])
# Fit the model
model.fit(X_train, Y_train,
batch_size=batch_size,
shuffle=True,
epochs=nb_epochs)
Model Summary Convolution Neural Networks

Word Embeddings with Recurrent Neural Networks (LSTM/GRU/Bi-directional LSTMs) on Election Tweets

The neural network architecture (each of LSTM, GRU, Bi-directional LSTM/GRU) is modeled to 20000 most frequent words, where each tweet is padded to a maximum length of 2000. The first layer is the Embedded layer that uses 128 length vectors (each word is tokenized with Keras’s Tokenizer) to represent each word. The next layer is the LSTM layer with 256 memory neurons. Finally, the results are fed to a single output Dense layer with 8 neurons and a softmax activation function to predict the associated mood.

#fileName classifyw2veclstm.py at https://github.com/sharmi1206/elections-2019
NO_CLASSES = 8
embed_dim = 128
lstm_out = 256
model = Sequential()
model.add(Embedding(num_words, embed_dim, input_length = X.shape[1]))
model.add(LSTM(lstm_out, recurrent_dropout=0.2, dropout=0.2))
model.add(Dense(NO_CLASSES, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics = ['categorical_crossentropy'])
print(model.summary())
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42, stratify=Y)
# Fit the model
model.fit(X_train, Y_train,
batch_size=batch_size,
shuffle=True,
epochs=nb_epochs)
output_test = model.predict(X_test)

The model yields 99.58% accuracy over 5 epochs with a batch-size of 128 .

Epoch 5/5
...........64/7344 [..............................] - ETA: 58:45 - loss: 0.0218 - acc: 1.0000
128/7344 [..............................] - ETA: 54:28 - loss: 0.0259 - acc: 1.0000
192/7344 [..............................] - ETA: 57:35 - loss:
........
........
7232/7344 [============================>.] - ETA: 58s - loss: 0.0328 - acc: 0.9960
7296/7344 [============================>.] - ETA: 24s - loss: 0.0330 - acc: 0.9959
7344/7344 [==============================] - 3811s 519ms/step - loss: 0.0331 - acc: 0.9958

Conclusion

In this post, we reviewed deep learning methods for creating vector representations of sentences with RNNs, CNNs and presented their effectiveness in solving a supervised sentiment prediction.

  • With glove pre-trained word embeddings, Bi-directional LSTM and Bidirectional GRU with Attention Layer perform the best, while Auto-encoder model, performs the worst both in case of BJP and Congress.
  • With Word Embedding matrix solely trained with election context tweets increases accuracy of models (LSTM, GRU, Bi-directional LSTM/GRU) to almost 99.5%. But CNN model performs the worst, with 50% accuracy.

However each of these models can be further improved using extensive tuning of hyper-parameters, different epochs, learning rates and addition of more labelled data for minority classes. Further altering the neural network architecture by increasing or decreasing the number of neurons and hidden layers might give added improvements.

References

  1. https://www.researchgate.net/figure/The-architecture-of-sentence-representation-learning-network_fig2_325642880
  2. https://blog.myyellowroad.com/unsupervised-sentence-representation-with-deep-learning-104b90079a93
  3. https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/
  4. http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  5. https://code.google.com/archive/p/word2vec/

Please let me know if there were any mistakes, suggestions feedbacks are welcome. The election repository is available at https://github.com/sharmi1206/elections-2019. Please feel free to follow me at linkedin.