Original article was published on Deep Learning on Medium
So, Before we starting the model implementation we have to know some concept of word embedding and LSTM, Let’s begin with them first.
Word Embedding is the technique of representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.
For Deep detail, you can refer to this article
LSTM(Long Short Time Memory)
LSTM is a type of Recurrent neural network, was designed by Hochreiter & Schmidhuber.the problem of long-term dependencies of RNN in which the RNN cannot predict the word stored in the long term memory but can give more accurate predictions from the recent information.while the gap length increases RNN does not give efficient performance it can retain the information for a long period of time. It is used for processing, predicting, and classifying on the basis of time-series data.
Structure of LSTM
Importing and Loading the datasets
import pandas as pd
This dataset has 5 features so here we can consider only 4 feature because
id column is not highly co-related to the dependent variables.
Dataset is available on Kaggle.
Let’s see the descriptive statistics
Now, drop the null values from columns
df=df.dropna()#drop the Nan Values
Let’s cross-check of null values
Get the Independent and dependent variables
#Get the Depndent feature
Here, we are using the TensorFlow 2.0, let’s import the modules.
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
So, before going to build embedding layers we have to set the vocabulary size
Now, the convert x and y values
Now, import the corpus for stopwords
from nltk.corpus import stopwords #corpus is collection of text
from nltk.stem.porter import PorterStemmer
corpus = 
for i in range(0, len(messages)):
review = re.sub('[^a-zA-Z]', ' ',messages['title'][i])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
Some example of a corpus
Now, we can see the words are clean.
onehot_rep = [one_hot(words, vo_size) for words in corpus]
We can see how sentences are converted into numbers.
Now, Start to build Embedding layers, with some paddings so we can balance matrix.
sent_length = 20
embedded_doc=pad_sequences(onehot_rep, padding='pre', maxlen=sent_length)
For example, you can see how zeros are added in the matrix.
embedding_vector_feature = 10
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
we can see the model summary, let’s train the model.
Splitting data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.30, random_state=41)
Here I gave the test size 30% for model testing.
model.fit(X_train,y_train, validation_data=(X_test,y_test), epochs=10, batch_size=64)
Now, the model is ready to let’s check the accuracy using Performance Metrics And Accuracy.
from sklearn.metrics import confusion_matrix
Let’s check the confusion matrix
from sklearn.metrics import accuracy_score
Here we got the 90 percent accuracy that’s really good for predict the fake news, you can change some parameters or you can do dropout technique, and as we see the how LSTM and word embedding technique we used here.