Political Data Science: A tale of tweets

Source: Deep Learning on Medium

Political Data Science: A tale of tweets

Analysing Sentiment towards a new Scottish Independence Referendum

Keywords: Twitter, tweepy, vaderSentiment, nltk, scikit learn, keras, machine learning, deep learning, LSTM, RNN, SVM, Naive Bayes, geoplot, geopy, nominatim, folium, pygeocoder.

Word Cloud of tweets

We are all familiar with the immense power data-driven political campaigns have in the electorate. We only have to look at the 2016 U.S. presidential election where Hillary Clinton and Donald Trump went head to head and ended up revolutionising the way U.S. politicians win elections.

Instead of using the traditional strategy, their campaigns capitalised on data that gave them the ability to micro-target voters. Eitan Hersh, Professor at Yale University, concluded in his book Hacking the Electorate, that if you have enough data, you can predict how people will behave and even how they will vote. And campaigns have developed sophisticated ways of doing this.

And although voters will eventually go for the politician or the idea that most aligns with their priorities, what is important is to know that data gives us ways to discover what those priorities are.

Photo by Element5 Digital on Unsplash

In the UK, things haven’t been very easy going and there were a lot lessons learned and used from the Trump campaign by the Brexit promoters. Fast forward to 2020 and the main Brexit figure, Boris Johnson, is now the newly elected Conservative Prime Minister of the rather divided “United” Kingdom.

The current political situation in Scotland after the Brexit vote, and most recently, Boris Johnson’s win in the winter General Election of 2019, is very heated.

Scottish voters were first asked whether they wanted Scotland to become an independent country in a referendum in September 2014; the result was 55% to 45% against independence.

© FT Montage/Getty/PA

The Scottish National Party (SNP)’s 2019 General Election manifesto stated that the party intended to hold a second referendum in 2020; and they won 48 of Scotland’s 59 seats in the UK’s House of Commons. So, naturally, First Minister Nicola Sturgeon has now claimed that the kind of future desired by most people in Scotland is very clearly different to that favoured by much of the rest of the UK” . Earlier this month she formally requested the power to hold an independence referendum before the end of 2020, but it has been denied by Boris Johnson in early January.

All of this has created a complicated environment in a nation that appears to be in a very difficult (and not very favourable) position within Boris Johnson’s UK.

The questions

Given the situation, I wondered if we can get an idea of what the reaction of the Scottish people has been during this tumultuous times. Particularly, after Boris Johnson’s win and refusal to allow the Scots another referendum.

So my questions are:

How are people in the UK reacting to current political climate regarding calls for a new Scottish independence referendum. And Boris Johnson’s refusal to allow one?

How do those reactions differ across people in Scotland, England, Wales and Northern Ireland? With particular interested in Scotland.

So, I downloaded a week of tweets from January 8th to January 15th, 2020 using the keywords “indyref2”, “scottish independence” and “scotref”. The process I used to scrape the data and to analyze sentiment can be repeated for any twitter account of media page.

Using Sentiment Analysis (also known as opinion mining), a Natural Language Processing subfield, I looked into twitter as a sort of political barometer for Scottish Independence. I trained two Supervised Machine Learning models: Support Vector Machine(SVM) and Naive Bayes Classifier. I then performed a systematic comparison with a Deep Learning Recurrent Neural Network (RNN) known as Long-Short-Term-Memory (LSTM) Network. After evaluation of the models, I chose and used the LSTM model to predict sentiment in the twitter dataset.

You can check out the complete project with its technical details in the Github repository.

Photo by George Pagan III on Unsplash

Why twitter?

I chose to look at Twitter because it — and social media in general — is becoming increasingly integral to everyday life and this is also true in the world of politics. Also, Sentiment Analysis projects use mostly Twitter data because that’s what’s (almost) publicly available whereas it is (almost) impossible to collect any useful data from Facebook.

ETL: Extract, Load, Transform

The training data was obtained from Sentiment140 and is made up of about 1.6 million random tweets with corresponding binary labels “0” for Negative sentiment and “4” for Positive sentiment. The original dataset consists of 80k tweets labeled positive and 80k labeled negative. That’s a lot of tweets. So I took a sample of the larger training data set to avoid long waiting times. In the end, I used 25,160 tweets labeled negative and 24, 840 labeled positive tweets for training.

Distribution of positive and negative tweets in training dataset by word count.

A pretty balanced training dataset as shown in the above histogram.

In any natural language processing task, cleaning raw text data is an important step. It helps in getting rid of the unwanted words and characters which helps in obtaining better features. For this step, I used a pipeline-like preprocessing step to remove unwanted noise in the tweets using some helper functions:

<script src=”https://gist.github.com/gracecarrillo/53ab0c64121514abe02a74e483fd29ce.js“></script>

After this step, you can see the difference between the raw tweets and the cleaned tweets quite clearly. Only the important words in the tweets are retained and the noise (numbers, punctuations, and special characters) has been removed.

For the test data, I downloaded the test dataset using twitter’s API and will be use to test the model’s real world performance. I used tweepy which is a python wrapper library for the Twitter API that gives you more control on how you query the API. To get a good chunk of data for the test dataset, I downloaded data using the key words 'indyref2', 'scottish independence' and 'scotref' through the API search function. Twitter’s REST API has a limitation in its free version though. It searches against a sampling of recent Tweets published in the past 7 days. So I was only able to collect a sample of 636 tweets. Nonetheless, I was able to extract some interesting information from them.

For a detailed step-by-step of this section, check the notebook in my Github.

Feature Engineering

To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques like Bag of Words (BoG), TF-IDF, and Word Embeddings.

I went for Bag of Words to keep the project simple. But a basic approach is not be able to capture the difference between phrases like “I like you”, where “like” is a verb with a positive sentiment, and “I am like you”, where “like” is a preposition that expresses a different sentiment.

To improve this technique I extracted features using Vader’s Polarity Scores and Part of Speech (POS) tags.

Vader sentiment analysis tool produces four sentiment metrics. The first three, positive, neutral and negative which is self explanatory. The final metric, the compound score, is the sum of all of the lexicon ratings, which are then standardised to range between -1 and 1. I used these scores to create features based on the sentiment metrics of the tweets, which were the used as additional features for modeling. These are very useful metrics if you want multidimensional measures of sentiment for a given sentence. I extracted and created the new features with the following helper function:

<script src=”https://gist.github.com/gracecarrillo/2eb646cad60a146d5b93fa2e3c6213fb.js“></script>

Part of Speech tagging (POS) is where a part of speech is assigned to each word in a list using context clues. This is useful because the same word with a different part of speech can have two completely different meanings. Is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition. This task is not straightforward, as a particular word may have a different part of speech based on the context in which the word is used. For this project, I used a simple lexical based method that assigns the POS tag to the most frequently occurring word in the training corpus and add the tags as features in our model. See below for the helper function:

<script src=”https://gist.github.com/gracecarrillo/141e0e5cf7e20d78a669c03465dcf068.js“></script>

Check out the notebook in my Github for a step-by-step of this section.

Model definition and training

Supervised Machine Learning

After being done with all the pre-modeling stages, we build our models, train them and test them. For this task, I first defined and trained two supervised machine learning algorithms:

  • Naive Bayes Classifier
  • Support Vector Machine Classifier

They are arguably two of the most used techniques for any classification task.

Naive Bayes is a kind of classifier which uses the Bayes Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class. The class with the highest probability is considered as the most likely class.

Copyright © Chris Albon, 2020.

SVM classifier works by mapping data to a high-dimensional feature space so that data points can be categorised, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data are transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.

Copyright © Chris Albon, 2020.

After defining the models, I used Scikit Learn GridSearchCV to cross-validate and select the best hyper-parameter configuration at the same time. With Grid Search you set up a grid of hyperparameter values and for each combination, train a model and score on the validation data. In this approach, every single combination of hyperparameters values is tried.

I passed the combined hyperparameters to the GridsearchCV object for each classifier and 10 folds for the cross validation which means that for every parameter combination, the grid ran 10 different iterations with a different test set every time (this took a while…).

<script src=”https://gist.github.com/gracecarrillo/84218a4403b2165fa6d44ba7736598fd.js“></script>

After trying out the different model parameter combinations, the GridsearchCV returned the best performing model per classifier. I then saved the models for evaluation. A necessary step if we plan to deploy the model.

Deep Learning

For comparison, I also implemented a Recurrent Neural Network (RNN) known as Long Short-Term Memory (LSTM).

To quickly describe it, think about how we think. We don’t start our thinking from scratch every second. We build on it. Just as you are reading this post, you are increasing your understanding of it as you read.

Traditional neural networks can’t do this.

And if we want a neural network to understand our tweets, we need one that can learn from what it reads and build on it. RNNs address this issue. They are networks with loops in them, allowing information to persist. But we also need our network to use some of the context in the tweets to learn. Meaning, we need it to remember information for a longer period of time than its other RNN siblings are able to.

Enters LSTMs! They are the child we need.

The network architecture I used is as follows:

  • First, we passed in words to an embedding layer, our first hidden layer.
  • After the input words were passed to the embedding layer, the new embeddings are passed to LSTM cells, our second hidden layer.
  • Finally, the LSTM outputs go to a softmax output layer.
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

There was quite a bit of data preparation involved to get the tweets ready to enter the LSTM network. I had to encode the tweet’s words to integers. So, I had to convert the tweets into sequences of integers using Tokenizer from Keras. Then, the encoded tweets can be passed into the network.

Building the network was pretty easy using Keras. You can simply stack multiple layers on top of each other:

<script src=”https://gist.github.com/gracecarrillo/45279ecc33f3a7d217ac76ff44273735.js“></script>

I ran the model for 10 epochs and observed that the loss for the validation data begins to increase after epoch 1, which suggest overfitting. With the help of the helper functions eval_metric and optimal_epoch I addressed the issue by tweaking the model a bit.

Minimum validation loss reached in epoch 1

Two things we observe from the graphs are:

  • The training loss keeps decreasing after every epoch. Our model is learning to recognise the specific patterns in the training set.
  • The validation loss keeps increasing after every epoch. Our model is not generalising well enough on the validation set.

The training loss continues to go down and almost reaches zero at epoch 10. This is normal as the model is trained to fit the train data as good as possible.

So, we are overtraining (also known as, the model is overfitting).

To address the problem, I applied regularisation, which comes down to adding a cost to the loss function for large weights.

Minimum validation loss reached in epoch 2

We can see that it starts overfitting in the second epoch and the validation loss increases slower afterwards.

At first sight the reduced model seemed to be the best model for generalisation. But then I checked on the test set using the test_model helper function, which gave a test accuracy of 74.81% for the first LSTM and 74.47% for the LSTM with regularisation.

So applying regularisation helped with the overfitting but it didn’t do much to the model’s accuracy on the test data.

Check out the notebook in my Github for a step-by-step of this section.

Model Evaluation

Model Evaluation is an integral part of the model development process. It helps find the best model that represents our data and how well the chosen model will work in the future. For this classification task, I used these evaluation metrics:

  • Confusion Matrix
  • Accuracy, Recall, Precision and F1-Scores

A confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data.

Confusion Matrices for our Supervised Machine Learning Models

From the above:

  • Naive Bayes Classifier: The model predicted 76% of labels correctly as negative and 73% correctly as positive. The model predicted 27% of the labels as negative, but they were positive (false negatives). The model predicted 24% of the labels as positive when they were negative (false positive).
  • SVM Classifier: The model predicted 75% of labels correctly as negative and 75% correctly as positive. The model predicted 25% of the labels as negative, but they were positive (false negatives). The model predicted 25% of the labels as positive when they were negative (false positive).
Confusion matrices for our LSTM Recurrent Neural Networks

From the above:

  • LSTM Neural Network: The model predicted 76% of labels correctly as negative and 73% correctly as positive. The model predicted 27% of the labels as negative, but they were positive (false negatives). The model predicted 24% of the labels as positive when they were negative (false positive).
  • LSTM Neural Network with Regularisation: The model predicted 74% of labels correctly as negative and 73% correctly as positive. The model predicted 25% of the labels as negative, but they were positive (false negatives). The model predicted 26% of the labels as positive when they were negative (false positive), which is a bit of an improvement in accuracy over the first RNN model.

Next, we look at the classification reports for accuracy, recall, precision and F1-Scores. The differences between these metrics and why we don’t simply rely on the accuracy score is nicely explained in this post.

For Naive Bayes and SVM:

Classification Report for Supervised Machine Learning models

For the LSTM models:

Classification Report for LSTM models

From the above metrics, our models seem to be performing relatively well. Note that both macro and micro averages result in the same score in all four models. This means our data is well balanced, a.k.a. the distribution of classes in our training dataset is symmetrical. But we knew that already. The point is that a balanced dataset allows us to the overall accuracy metric to choose a model.

In this paper, researchers found that human raters typically agree 80% of the time.

Therefore a 75% accurate model is doing almost as good as humans raters.

Both SVM and LSTM network classifier without regularisation outperformed the other models by 1%, achieving 75% overall accuracy.

I chose LSTM over the SVM model for the next step because generally, deep learning really shines when it comes to complex problems such as natural language processing. Another advantage is that you have we worry less about the feature engineering part when it comes to model deployment.

Ultimately the true test was using the model on unseen, real world data.

Check out the notebook in my Github for a step-by-step of the model evaluation section.

Modeling the unknown: the referendum twitter data.

Finally, we have arrived to the fun part. Our topic-related twitter data!

Let’s explore the data a bit first by checking out what the most common words in the dataset are:

Understanding the common words used in the tweets

As we can see, the words “indyref2”, “scottish”, “independent” and “scotland” are disproportionately common. Of course they are. They are part of my keywords for downloading tweets! So I excluded them and had another look:

Understanding the common words used in the tweets without keywords

March? Why the hell is that word the most used one?

Ah! If you live in the UK you may be aware that on January 11th, thousands of Scottish independence supporters marched through Glasgow in the first of a series of protests planned for 2020. And that date falls within the timeframe of the downloaded tweets.

Image: Robert Perry via Daily Record

So yeah… march.

Let’s visualize all the words using a word cloud plot, because they’re pretty cool:

Word Cloud with word frequencies

You can see that words like “glasgow”, “snp”, “people”, “today”, “union”, “march” and “referendum” are the most frequent ones. It doesn’t give us any idea about the associated sentiment of the tweets though. So let’s get on with the predictions.

I used the LSTM trained model to create predictions of the tweets dataset. I must emphasise that the model detects negative or positive sentiment in general. It does not detect if someone is tweeting something against or in favour of the topic in question.

I also had to do some cleaning to make it Neural Network-friendly. Basically the same preprocessing as what with the training data.

After the predictions, the number of positive tagged tweets was 263
and the number of negative tagged tweets was 373:

Distribution of positive and negative tweets in twitter dataset by word count.

There’s a higher number of tweets tagged as negative compared to tweets tagged as positive, particularly when the word count of the tweet is higher. The longer the tweet the most likely it is that it’s negative, for our data.

Is also good to have a quick inspection of the tweets and their tags.

The following tweets were tagged as negative:

@anninnis @BBCPolitics Vote NO to stay in the EU they said in 2014. Things change, time for IndyRef2


I’m not surprised (in the slightest) that @BorisJohnson has refused @NicolaSturgeon’s request to hold #Indyref2.

They do read a bit angry or upset, although that depends on the reader’s perspective.

The following were tagged as positive by our model:

@CoyJudge Because that vote was done and lost so I moved on now we have Brexit which I was on the winning side of.

I’m Labour, on this I agree totally with Boris. SNP sort out all the sh*t they have created in Scotland. Indyref2 is dead.

Right. So that second tweet prediction looks a bit odd to me and I wouldn’t classify it as having a positive sentiment.

Alas, for the purposes of this project, I continued with this model predictions.

Geospatial analysis

Because I love maps, I just had to create geographic visualizations to explore how they are distributed across Scotland, and also across the UK nations since the tweets revolve around Scottish and UK news.

It’s worth mentioning that there are limitations with our dataset going forward. The majority of twitter users do not broadcast their geolocation but my search criteria only pulled tweets that had some information that would allow me to get the geolocation.

So it’s possible that I’m missing a lot of tweets without geotags that would paint a different picture than the one I come up with when plotted on a map.

Knowing this is crucial in interpreting results but also to understand how we can make the model more robust in future analysis.

Now plotting this wasn’t easy. There were several issues with the data, lack of shape files and other hiccups. For example, one location was indicated as Kelvingrove Park, which is in Glasgow, so it should be replaced with a Glasgow label. Or some tweets showed small towns, like Coaltown of Balgonie which belongs to the Fife Area. I’d have to rename that as well for consistency.

After all that, finally I was ready for some plots and geoplots.

First, let’s look at where the majority of tweets originate.

Histogram of tweets counts by country of origin
Histogram of tweets counts by nation/state of origin
Histogram of tweets counts by city of origin

Most tweets come from Scotland and specifically, from Glasgow, a city known for its high percentage of pro-independence supporters. However, a good amount come from England.

Next I extracted the total sentiment from each one of the UK cities by adding positives to negatives and the final number is a +/- indicator.

Histogram of total sentiment

From the histogram is clear that the majority of values lie between -5 and 5 with the overall sentiment slightly skewed towards the negative side. The average (median) sentiment is a negative sentiment (-1). There are also a couple of cities with a large value of total negative sentiment, compared to the rest of the cities in the dataframe.

I then generated the following map to visualise the spatial distribution of cities in the dataframe across these negative and positive sentiment dimensions. Check out the interactive map by clicking here.

Map of sentiment distribution

As a reminder, the total sentiment used is obtained by summing up the positives and negatives for each city to come up with a final number.

I set up a scale of colours where any sentiment below -1 is marked as red and negative, between -1 and 1 is light blue and neutral, and finally, above 1 is considered as positive and coloured in blue.

Overall, it seems that tweets sent from cities in Scotland have more negative sentiment than those coming from England and Wales. There is one city in Northern Ireland with tweets that give an overall neutral sentiment. The city with the highest negative sentiment overall (-19) is Glasgow, followed by Invergordon (-13) in the north of Scotland.

Tweets from Edinburgh, Glasgow, Dundee and Stirling show negative sentiment, with Aberdeen as the only main city with positive sentiment. For England, we can see more blue dots than red. However, tweets coming from large cities like Manchester, London and Bristol have an overall negative sentiment. Tweets coming from Cardiff, the capital of Wales, have an overall positive sentiment. Not enough tweets originated in Northern Ireland to get an idea of the the overall sentiment.

Next, I generated a heatmap to see every tweet in the dataset from their respective geolocation data and observe the density. Check out the interactive map by clicking here.