Original article can be found here (source): Deep Learning on Medium
This article will set the scene for a series of articles on Retrieving evidence for Fake News detection using Deep learning.
The series will look at how to detect fake news in an automated manner, which data science frameworks and tools to consider, how to work with large data sets, including recommendations and precautions when using cloud infrastructure such as AWS.
I’ve always been fascinated with rumour’s impact on a group of people. At its most basic, every rumour is just a subject of conversation, or folk tale, but at its most severe, rumour can cause major societal impacts and dynamic shifts in favour of its benefactors.
“In a lawsuit the first to speak seems right, until someone comes forward and cross-examines.” — Book of Proverbs
With this frame of thought, every rumour seems true until it is supported or negated. The type of rumour called Fake news is false news. It is also rumour that is specifically news. Six main concepts exists relating to fake news:
Each one with its own peculiarity in the authenticity, intention and news categories, as well as in its methods of detection.
During the 2016 US presidential elections Donald Trump drove the surge in popularity of the term “Fake News”. Also, in his first press conference as president, he called a CNN reporter “Fake News”, this caused:
- A dynamic shift in favour of Donald Trump.
- A surge of fake news stories on the web.
- An increased interest in automated processes that detect Fake News e.g. PolitiFact.
Automated ways to detect Fake News can be divided into 4 main streams:
- Source-based Detection
- Style-based Detection
- Propagation-based Detection
- Content-based Detection
Each detection strategy is more proficient in finding certain types of rumour. Style-based detection identifies rumour by analysing it’s written format. This is good for finding disinformation. Propagation-based detection identifies rumour by analysing its spread. Source-based detection analyses account roles in rumour dissemination around a subject. Content-based detection involves analysis types that focus on false knowledge in Fake News. In this study, we focus on misinformation and we use fact-checking with evidence retrieval to classify for misinformation. Fact-checking is a content-based detection strategy.
The type of data source is vital for a robust model that efficiently detects misinformation. There is a wide range of freely available datasets to consider when taking up this problem. In addition to the FEVER Shared Task used in this study, some examples are the FakeNewsChallenge (FNC) dataset and the FootballTransferRumours (FTR-18) dataset.
The FEVER Shared Task
In the shared task, 50 annotators put together 185k labelled claims in a 5-way agreement, providing evidences for verifiable claims, with a substantial kappa agreement score of 69%.
For the task, every verdict that is given on a claim must be verified by the evidences extracted for that claim. A claim can be classified as SUPPORTED, REFUTED or NOT ENOUGH INFO. Evidences are extracted from Wiki articles. This is seen as the source of truth.
Below are the problem and solution formulation details of the Shared task:
Our dataset sizes are:
- 5M+ Introductory sections of Wiki pages — 7GB
- 185k+ claims — 32mb
The files format are json lines. This saves space when stored on disc and can be read straight into a pandas dataframe for analysis. Due to the multiplicative data size during calculations across every claim per Wiki page, we need extra fire power. For this we call in our Big guns in the AWS suite to perform these calculations at scale.
We followed the standard 3 stage process proposed in the FEVER Shared task to solve the problem.
Stage 1: Document Retrieval
The document retrieval step involves extracting the top N most relevant documents to a claim from a corpus. it is important that document retrieval has a high recall allowing documents containing evidences required to be extracted early in the process, so evidences get a chance to filter through the process.
To improve recall, the Named Entity Recognition (NER) module provided in the python Spacy package is used to acquire named entities in each text. We then filter the documents and perform other transformations and data enrichment before applying TF-IDF to each wiki pages per claim, ranking documents according to importance, and retrieving the top N documents.
Stage 2: Sentence Selection
In Sentence Selection, we extract the top L most relevant sentences to a claim from the top N documents. The top N documents are split into sentences. Unlike Document Retrieval, Sentence Selection involves evidence retrieval for a claim. Hence, we would like to extract sentences that are highly interrelated to a given claim, because this far extends word importance provided by TF-IDF we use Word2Vec embeddings for word vectorization, we fed these into the LSTM variant architectures, classify and then rank the scores for our claim-sentence pairs, and then selecting the top L sentences.
Stage 3: Claim Verification
In Claim Verification, we classify each claim as SUPPORTED, REFUTED or NOT ENOUGH INFO using the evidences retrieved. Here each claim is concatenated with the top L sentences before it is fed either into an LSTM variant architecture, or a Machine Learning classifier(e.g. Random Forest). The ML classifer requires that morphological, grammatical and lexical features are extracted from the text. The chosen classifier provides the verdict.
Inline with the Shared task, the verdict provided is only correct if the required evidence for verification is present.
Working with Large Datasets on AWS
AWS services like SageMaker and Redshift were used to make the data processing and solution formations in these pipelines feasible. I will highlight how and where these services were used in following articles. For now, I will leave you with some learnings, notes and takeaways from working with large data sets like the one encountered in this study.