Fake News Classification Using LSTM And Word Embedding layers in Keras

Original article was published on Deep Learning on Medium

So, Before we starting the model implementation we have to know some concept of word embedding and LSTM, Let’s begin with them first.

Word Embedding

Word Embedding is the technique of representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.

For Deep detail, you can refer to this article

LSTM(Long Short Time Memory)

LSTM is a type of Recurrent neural network, was designed by Hochreiter & Schmidhuber.the problem of long-term dependencies of RNN in which the RNN cannot predict the word stored in the long term memory but can give more accurate predictions from the recent information.while the gap length increases RNN does not give efficient performance it can retain the information for a long period of time. It is used for processing, predicting, and classifying on the basis of time-series data.

Structure of LSTM

Structure of LSTM

Implementation

Importing and Loading the datasets

import pandas as pd
df=pd.read_csv('train.csv')
df.head()
Image by author

This dataset has 5 features so here we can consider only 4 feature because

id column is not highly co-related to the dependent variables.

Dataset is available on Kaggle.

Let’s see the descriptive statistics

df.describe()
Image by author

Now, drop the null values from columns

df=df.dropna()#drop the Nan Values

Let’s cross-check of null values

df.isnull().sum()
Image by author

Get the Independent and dependent variables

X=df.drop('label',axis=1)
#Get the Depndent feature
y=df['label']

Here, we are using the TensorFlow 2.0, let’s import the modules.

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

So, before going to build embedding layers we have to set the vocabulary size

#vocabulary size
vo_size=500

Now, the convert x and y values

messages=X.copy()
messages.reset_index(inplace=True)

Now, import the corpus for stopwords

import nltk
import re
from nltk.corpus import stopwords #corpus is collection of text

Data Preprocessing

#dataset Preprocessing
from nltk.stem.porter import PorterStemmer
ps =PorterStemmer()
corpus = []
for i in range(0, len(messages)):
print(i)
review = re.sub('[^a-zA-Z]', ' ',messages['title'][i])
review = review.lower()
review = review.split()

review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
corpus.append(review)
corpus

Some example of a corpus

Image by author

Now, we can see the words are clean.

One-hot Representation

onehot_rep = [one_hot(words, vo_size) for words in corpus]
onehot_rep
Image by author

We can see how sentences are converted into numbers.

Now, Start to build Embedding layers, with some paddings so we can balance matrix.

sent_length = 20
embedded_doc=pad_sequences(onehot_rep, padding='pre', maxlen=sent_length)
print(embedded_doc)
Image by author

For example, you can see how zeros are added in the matrix.

embedded_doc[2]
Image by author

Model Building

embedding_vector_feature = 10
model=Sequential()
model.add(Embedding(vo_size,embedding_vector_feature,input_length=sent_length))
model.add(LSTM(100))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
Image by author

we can see the model summary, let’s train the model.

Splitting data into train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.30, random_state=41)

Here I gave the test size 30% for model testing.

model.fit(X_train,y_train, validation_data=(X_test,y_test), epochs=10, batch_size=64)

Now, the model is ready to let’s check the accuracy using Performance Metrics And Accuracy.

y_pred=model.predict_classes(X_test)
from sklearn.metrics import confusion_matrix

Let’s check the confusion matrix

confusion_matrix(y_test,y_pred)
Image by author
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

Here we got the 90 percent accuracy that’s really good for predict the fake news, you can change some parameters or you can do dropout technique, and as we see the how LSTM and word embedding technique we used here.