Using NLP to Detect Fake News

Original article can be found here (source): Deep Learning on Medium

Using NLP to Detect Fake News 📰

Fake News Everywhere 📢

Introduction

Nowadays, information can easily be accessible from anywhere 🌏. It is the age of information, where an individual can access the happenings of various events around the world in the comfort of his/her own home. It has resulted in the inaccuracy and irrelevancy in updating information by people which is commonly known as fake news 😨. Since a large proportion of the population uses social media for updating themselves with news, delivering accurate and altruistic information to them is of utmost importance. Due to the increasing number of users in social media, news can be quickly published by anybody, and its credibility stands compromised, which brings in a need for detection of fake news. Fake news detection has recently garnered much attention from researchers 👨‍🔬 and developers alike. This work proposes to detect fake news using various modalities available in an efficient manner using Deep Learning algorithms such as Convolutional Neural Network 🕸️ and Long Short-Term Memory.

METHODOLOGY

A. Fake News Classification with Deep Learning

  • Deep learning is a subset of machine learning which contains many useful and efficient algorithms when compared to other learning algorithms. In deep learning, the performance of a model is directly proportional to the amount of data that is being fed to the model. The Convolutional Neural Network and Long Short-Term Memory were used to create 🔧 the fake-news detection model.
Fig. 1. Deep Learning Model Performance with Amount of Data

B. Convolutional Neural Network

  • Convolutional Neural Network is a type of Artificial Neural Network that uses perceptron for cognitive tasks like image processing, language processing 👨🏻‍🚀. It is a class of deep neural networks that are also called a shift variant or space-variant artificial neural networks. They are a regularized version of a multilayer perceptron in which each node in a layer is connected to all the nodes in the next layer.
  • In CNN, there is little to no pre-processing of data required as they are more advanced and the filters that are supposed to be hand-engineered are learned by this independently.
Fig. 2. CNN Architecture

C. Long short-term memory

  • Long short-term memory 🧬 is a type of artificial neural network architecture that is used to process multiple data points in images, speech, audio as well as text. It consists of a cell and three gates, namely an input gate, forget gate and output gate. Unlike other architectures, LSTM has connections for feedback which are helpful in regulating the information flow through the gates.
  • This architecture is designed in such a way that it can remember the long-term dependencies of the data being presented to it. It could overcome the vanishing gradient problem that arises when using Recurrent Neural Networks. These models can be trained in both supervised and unsupervised manner.
Fig. 3. LSTM Architecture

D. Word Vector

  • The text in the dataset is converted into word vectors ⚔️ using techniques like co-occurrence matrix, count vectorizer or TFDIF vectorizer, Continous bag of word or skip-grammar (Word2vec). Here, each sentence consists of words that are converted to vectors using embedding techniques. The pre word embedding models like GloVe, ELMo, fasttext, BERT can be used for the purpose of obtaining a vector representation of words.
  • It is based on the observation that the ratio of word-word co-occurrences probabilities can be used to encode meaning. It is trained on non-zero word-word occurrence entries which shows how frequently words co-occur with each other.
Fig. 4. Euclidean representation of words using Word Embedding

E. Fakeddit DataSet

  • The dataset is called Fakeddit as it is derived from Fake News + Reddit. Fakeddit, a novel dataset comprising of around 800,000 examples from different classifications of fake news. Each example is marked by 2-way, 3-way, and 5-way characterization classes. The dataset contains features like text, clean title, number of upvotes, comments, score, upvote ratio.
  • The dataset 📅 containing text is fed into the model in which the words and sentences are converted into vectors and pass through the different layers containing a receding number of nodes to finally get classified as real or fake in the output layer. In this model, we use the feature “clean title” in the dataset as input. It consists of 69954 entries of data occurrences for each column in the training dataset. (Link for the dataset- https://github.com/entitize/Fakeddit)
Fig. 5. Word Cloud for the Fakeddit Dataset
  • The most frequent words as Real and Fake in the dataset has been plotted to visualize trends whether the given sentence is fake or real.
Fig. 6. Scatter Plot between Fake v/s Real Words in the dataset
  • Fakeddit provides a large quantity of text+image samples with multiple labels for various levels of fine-grained classification.
  • With such massive data points in the dataset, it can provide more generic results and helps to identify better credibility of the news.

Code

The Jupyter notebook and the code 👨‍💻 can be found at my GitHub account. https://github.com/Arjun009/Fake-News-Detection.

Following is the basic implementation of the LSTM model in python.

RESULTS

Fake news detection model was successfully 💯 built with the accuracy of more than 90% for the training set and more than 80% for the validation set.