Original article was published on Artificial Intelligence on Medium
1. Exploratory Data Analysis
Create a file named eda.ipynb or eda.py in your project directory.
We will first import all the required packages.
#Importing all the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
Now we will first read fake news dataset using
pd.read_csv() and then we will explore the dataset.
In cell 4 of the above notebook, we count the number of sample fake news in each of the subject. We will also plot its distribution using seaborn count plot
We will now plot a word cloud by first concatenating all the news in a single string then generating tokens and removing stopwords. Word cloud is a very good way to visualize the text data.
As you can see in the next cell now we will import true.csv as real news dataset and perform the same steps as we did on the fake.csv. One different thing you’ll notice in the real news dataset is that in the text column, there is a publication name like WASHINGTON (Reuters) separated by a hyphen(-).
It seems that the real news is credible as it comes from a publication house, so we will separate the publication from the news part to make the dataset uniform in the preprocessing part of this tutorial. For now, we’ll just explore the dataset.
If you are following along, you can see that the news subject column has non-uniform distribution in real and fake news dataset so, we will drop this column later. So that concludes our EDA.
Now we can get our hands dirty with something you guys have been waiting for. I know this part is frustrating but EDA and preprocessing is on of the most import in any Data Science lifecycle
2. Preprocessing and Model Training
In this part we will perform some preprocessing steps on our data and train our model using insights obtained from the EDA we did previously.
To follow along code in this part open train ipynb file. So without much further ado lets get started.
As usual Importing all of the packages an reading the data. We will first remove Reuters from real data text column. As there are some rows in which Reuters is absent so we will first get those indices.
Removing Reuters or Twitter Tweet information from the text
- Text can be split only once at “ — “ which is always present after mentioning the source of publication, this gives us publication part and text part
- If we do not get text part, this means publication details wasn’t given for that record
- The Twitter tweets always have the same source, a long text of max 259 characters
#First Creating list of index that do not have publication part
unknown_publishers = 
for index,row in enumerate(real.text.values):
record = row.split(" -", maxsplit=1)
#if no text part is present, following will give error
#if len of publication part is greater than 260
#following will give error, ensuring no text having "-" in between is counted
assert(len(record) < 260)
To summarize in one line what the above code does is get the index of text column where the publisher is absent in real dataset.
Now we will separate the Reuters from the text column.
# separating publishers from the news text
publisher = 
tmp_text = 
for index,row in enumerate(real.text.values):
if index in unknown_publishers:
#add text to tmp_text and "unknown" to publisher
record = row.split(" -", maxsplit=1)
In the above code we iterate over the text column and check if index belongs to if it does then we append text as it as to and “Unknown” to publishers list. Else we split the text into publishers and news text and append in respective lists.
#Replace existing text column with new text
#add seperate column for publication info
real["publisher"] = publisher
real["text"] = tmp_text
The above code is pretty self-explanatory, we add a new publisher column and replace text column by news text which is without Reuter.
We will now check if there are any missing values in the text column in both real and fake news dataset and drop that row.
If we check through fake news dataset we will see that there are many rows with missing text values and the whole news is present in
title column, so we will merge
real['text'] = real['text'] + " " + real['title']
fake['text'] = fake['text'] + " " + fake['title']
Next we will add class to our dataset, drop unnecessary columns and merge our data into one.
# Adding class info
real['class'] = 1
fake['class'] = 0# Subject is diffrent for real and fake thus dropping it # Also dropping Date, title and Publication real.drop(["subject", "date","title", "publisher"], axis=1, inplace=True) fake.drop(["subject", "date", "title"], axis=1, inplace=True)#Combining both into new dataframe data = real.append(fake, ignore_index=True)
Removing StopWords, Punctuations, and single-character words. (very common and basic task in any NLP project).
Word2Vec is one of the most popular techniques to learn word embeddings using shallow neural networks. It was developed by Tomas Mikolov in 2013 at Google. Word embedding is the most popular representation of document vocabulary. It is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, etc.
If you want to learn more about it click here
Let’s create our Word2Vec model.
#install gensim if you haven't already
#!pip install gensim
import gensim#Dimension of vectors we are generating
EMBEDDING_DIM = 100
#Creating Word Vectors by Word2Vec Method
w2v_model = gensim.models.Word2Vec(sentences=X, size=EMBEDDING_DIM, window=5, min_count=1)#vocab size
#We have now represented each of 122248 words by a 100dim vector.
These Vectors will be passed to LSTM/GRU instead of words. 1D-CNN can further be used to extract features from the vectors.
Keras has an implementation called “Embedding Layer” which would create word embeddings(vectors). Since we did that with gensim’s word2vec, we will load these vectors into the embedding layer and make the layer non-trainable.
We cannot pass string words to the embedding layer, thus need some way to represent each word by numbers.
Tokenizer can represent each word by number
# Tokenizing Text -> Repsesenting each word by a number
# Mapping of orginal word to number is preserved in word_index property of tokenizer#Tokenized applies basic processing like changing it yo lower case, explicitely setting that as False
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)X = tokenizer.texts_to_sequences(X)
We Create a matrix of mapping between word-index and vectors. We use this as weights in the embedding layer. Embedding layer accepts the numerical-token of word and outputs corresponding vector to the inner layer. It sends a vector of zeros to the next layer for unknown words which would be tokenized to 0. Input length of Embedding Layer is the length of each news (700 now due to padding and truncating).
Now we will create a sequential Neural Network model and add the weights generated from w2v in the embedding layer and also add an LSTM layer.
#Defining Neural Network
model = Sequential()
#Non-trainable embeddidng layer
model.add(Embedding(vocab_size, output_dim=EMBEDDING_DIM, weights=[embedding_vectors], input_length=maxlen, trainable=False))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
Lets now split dataset into train set and test set using
sklearn train_test_split method.
Lets train the model using
model.fit(X_train, y_train, validation_split=0.3, epochs=6). It will take some time, on my machine it took around 40 minutes so sit back have some coffee and relax.
After training is done we will test it on
test dataset and generate report using
Wow, we got 99% accuracy with a good precision and recall so our model looks good, now let’s save it on disk so we can use it in our web application.
3. Building and deploying a web app
I am not going into much detail in this part, I’d recommend you to go through my code it is very easy to understand. If you’ve followed along till now, you must have the same directory structure if not then just change path variables in
Now upload the whole directory into a GitHub repository.
We will host our web app on Heroku. So if you haven’t already, create a free account on Heroku and then:
- Click on create new app
- Then select a name
- Select GitHub and select the repository from which you want to hold on
- Click on deploy.
And BOOM it’s done, your fake news classifier is now live.
If you’ve made it till the end, congratulations, now you can build and deploy a complex machine learning application.
I know it was a lot to grasp on but kudos to you for making till this far.
Note: The app works on most of the news, just remember to paste the whole paragraph of the news and preferably US news because dataset was constrained to US news.
In case we haven’t met already, I am Eish Kumar you can follow me on Linkedin: https://www.linkedin.com/in/eish-kumar/.
Follow me for more such articles.