Original article was published by Sourabh Dharpure (B17CS054) on Deep Learning on Medium
Media Bias on Twitter using Sentiment Analysis [BERT]
Table of Content
- Data Preparation
- Sentiment Prediction
- Analysis of Media Bias
Media bias is the bias or perceived bias of journalists and news producers within the mass media in the selection of many events and stories that are reported and how they are covered.
In this article, I have shown how some top News Channels of India is biased towards any particular Political party or Person.
The Jupyter Notebook of this project can be found here.
The data is collected from the official twitter accounts of the following English News Channels of India
- CNN News-18
- Republic TV
- Times Now
The pre-trained Bert base-uncased model is used for the sentiment analysis. Which can be found on the Hugging Face library. Then the model is used to predict the sentiment of the scrapped tweets.
We will train our BERT classifier for sentiment prediction on the Tweet Sentiment Dataset available on Kaggle.
train.csv contains the following columns-
textID– unique ID for each piece of text
text– the text of the tweet
sentiment– the general sentiment of the tweet
selected_text– [train only] the text that supports the tweet’s sentiment
The dataset contains selected_text and sentiment but we will use only sentiment column and text column for training purposes.
Converting the sentiment values to integers:-
- neutral — 1
- positive — 2
- negative — 0
Creating the dataset
Now we have to format our text according to the BERT required format
- addition of special SEP and CLS tokens.
- truncation and padding of each sentence to a constant length.
- addition of attention mask.
which can be done using the BERT tokenizer. We will use the dataloader class to iterate through the dataset so that the whole data need not to be loaded into memory.
Splitting the training and testing data
We will be only using the pooled output for the sentiment analysis ignoring the sequence output. The bert-base-uncased is used here which has only lower case letters and it is the smaller one among the two versions of BERT.
In the BERT paper authors have recommended to choose from the following values of the parameters-
- Batch size: 16, 32
- Learning rate (Adam): 5e-5, 3e-5, 2e-5
- Number of epochs: 2, 3, 4
we will be using
- Batch size: 16 (set when creating our DataLoaders)
- Learning rate: 2e-5
- Epochs: 3
I have used the Twitter Scrapper for scraping the latest 5000 tweets of every channel.
- channel — Name of the news channel
- tweet — Text of the tweet.
- Modi — Tweet contains keyword ‘modi’ or not
- Rahul Gandhi — Tweet contains keyword ‘Rahul Gandhi’ or not
- BJP — Tweet contains keyword ‘BJP’ or not
- Congress — Tweet contains keyword ‘Congress’ or not
- Amit Shah — Tweet contains the keyword ‘Amit Shah’ or not
- Arvind Kejriwal — Tweet contains the keyword ‘Arvind Kejriwal’ or not
Analysis of Media Bias
The chosen topics are covered in a total of 20,000 tweets. Let’s see to what extent those topics are covered overall by each channel.
Number of relevant tweets: 3408 Total number of tweets: 20000 Percentage of relevant tweets: 17.04%
Relative Topic Coverage
It shows how important each topic is to each channel, We will plot the count of tweets about every topic for each channel.
- It shows that except for the Times now news Channel the other three talks more about Narendra Modi than Rahul Gandhi.
- It can be seen from the above graph that Times Now is having the most tweets about BJP(and Modi) than any other channel and
- For Congress Party, the distribution is nearly equal among all channels with CNN having the maximum tweets about Congress Party.
- It shows that NDTV talks about BJP’s Political Leaders quite often and Arwind Kejriwal is having very few tweets on all the four channels.
Sentiment towards Topics
We will show the sentiments of News Channels towards our Topics by plotting the percentage of their positive tweets and negative tweets out of their total tweets about a topic.
Percentage Positive Tweets
Modi vs Rahul Gandhi
- It shows that all the four News Channel talks more positive about Narendra Modi than Rahul Gandhi
- NDTV is having a maximum positive tweets percent of 28.38%.
- Rahul Gandhi is having very few positive tweets from all the four channels.
- Combined positive percentage of Rahul Gandhi ( 5.43 + 8.8 + 6.25 + 6.66 ) is less than NDTV’s ( 28.38 %) for Narendra Modi.
BJP vs Congress
- It can be seen that the News Channels are more interested in any particular leader than a party because for both BJP and Congress the percent of positive tweets is less.
- Republic TV having more positive tweets about BJP party.
Percentage Negative Tweets
Modi vs Rahul Gandhi
- We can see clearly in the above figure that all the News channels talks are so much negative about Rahul Gandhi.
- But for Narendra Modi that is not the case, only Republic TV is having a significant negative percent, and the rest all are below 20%.
BJP vs Congress
- Here also BJP is having very low percent of negative tweets from all the channels.
- Congress is not so popular among these channels because every channel is having about 25% negative tweets which are very low than that of BJP.
- We can also see that for BJP most of the tweets are neutral (because both the positive and negative count is low).