Quora Insincere Question Classification(Attention model with LSTM)

Original article can be found here (source): Artificial Intelligence on Medium

Quora Insincere Questions Classification(Attention layer with LSTM)

Introduction & Business View:

An existential problem for any major website today is how to handle toxic and divisive content. Quora is a platform to gain & share knowledge where you can ask any question and get answers from different people with unique insights. At the same time, it’s important to handle the toxic contents to make the users safe to share their knowledge.

Quora has come up with a kaggle challenge to handle the problem of toxic contents by removing the insincere questions those found upon false premises, or that intend to make a statement rather than look for helpful answers.

In this blog, I will develop models that identify and flag insincere questions.we Help Quora uphold their policy of “Be Nice, Be Respectful” and continue to be a place for sharing and growing the world’s knowledge.

Problem statement:

Build a model for predicting whether a question asked on Quora is sincere or not.

Evaluation Metrics :

  • Metric is F1 Score between the predicted and the observed targets. There are just two classes, but the positive class makes just over 6% of the total. So the target is highly imbalanced, which is why a metric such as F1 seems appropriate for this kind of problem as it considers both precision and recall of the test to compute the score.

fig-1 : f1_score

  • ROC curve is not a good visual illustration for highly imbalanced data, because the False Positive Rate ( False Positives / Total Real Negatives ) does not drop drastically when the Total Real Negatives is huge.

Exploratory Data Analysis:

Overview of the data:

Quora provided a good amount of training and test data to identify the insincere questions. Train data consists of 1.3 million rows and 3 features in it.

Data fields:-

  • qid — unique question identifier
  • question_text — Quora question text
  • target — a question labeled “insincere” has a value of, 1 otherwise 0

First, let us import the necessary libraries.

First, let us import the necessary libraries.

import os
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
  • Next, we load the data from CSV files into a pandas dataframe
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
print(train.shape,test.shape)
train.head()

Let’s check data is imbalance or not:

train[“target”].value_counts()

Exploring Question:

  • Let’s print the total no: of null values each row has
train.isnull().sum()

we can see there are no null values in the train data.

  • Let’s check for the duplicates of rows
train.duplicated(subset={"question_text","qid","Target"}).value_counts()

There are no duplicate rows in the train data.

WordCloud representation:

Data Pre-processing :

For better improvement of the scores text data cleaning like data preprocessing need to done using below

  1. Html tag removal
  2. Punctuation removal
  3. Tokenization
  4. Lemmatization
  5. Contraction mapping
# Contraction replacement patterns
cont_patterns = [
(b'(W|w)on\'t', b'will not'),
(b'(C|c)an\'t', b'can not'),
(b'(I|i)\'m', b'i am'),
(b'(A|a)in\'t', b'is not'),
(b'(\w+)\'ll', b'\g<1> will'),
(b'(\w+)n\'t', b'\g<1> not'),
(b'(\w+)\'ve', b'\g<1> have'),
(b'(\w+)\'s', b'\g<1> is'),
(b'(\w+)\'re', b'\g<1> are'),
(b'(\w+)\'d', b'\g<1> would'),
]
patterns = [(re.compile(regex), repl) for (regex, repl) in cont_patterns]def cleanHtml(sentence):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, ' ', str(sentence))
return cleantext
def prepare_for_char_n_gram(text):
clean = bytes(text.lower(), encoding="utf-8")
clean = clean.replace(b"\n", b" ")
clean = clean.replace(b"\t", b" ")
clean = clean.replace(b"\b", b" ")
clean = clean.replace(b"\r", b" ")
for (pattern, repl) in patterns:
clean = re.sub(pattern, repl, clean)
exclude = re.compile(b'[%s]' % re.escape(bytes(string.punctuation, encoding='utf-8')))
clean = b" ".join([exclude.sub(b'', token) for token in clean.split()])
clean = re.sub(b"\d+", b" ", clean)
clean = re.sub(b'\s+', b' ', clean)
clean = re.sub(b'\s+$', b'', clean)
clean = re.sub(b"([a-z]+)", b"#\g<1>#", clean)
clean = re.sub(b" ", b"# #", clean) # Replace space
clean = b"#" + clean + b"#" # add leading and trailing #
data['question_text'] = data['question_text'].apply(cleanHtml)
return str(clean, 'utf-8')

Logistic regression as a random model:

The Most popular algorithm’s work on a simple solution and Logistic regression is a good proof of that

the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc.

While working on solving the natural language processing problem’s, Logistic regression works perfectly well for fast, accurate and reliable solutions. Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. The nature of target or dependent variable is dichotomous, which means there would be only two possible classes. … Mathematically, a logistic regression model predicts P(Y=1) as a function of X.

logistic regression:-

Logistic regression calculates the probability of a particular set of data points belonging to either of those class’ given the value of x and w. … The use of exponent in the sigmoid function is justified as probability is always greater than zero and the property of exponents takes care of this aspect.

Train-CV-Test splitting :

X_train, X_val, y_train, y_val = train_test_split(X, y,test_size=.2, random_state=42, stratify=y)X_train.shape, y_train.shape, X_val.shape, y_val.shape

We can’t apply the logistic regression algorithm directly on the train data as the data we have is some kind of text data which the algorithm will not be able to understand, So we convert the text to vector before applying any Machine learning algorithm.

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer()
reviews_tfidf = tf_idf_vect.fit_transform(x_train[‘question_text’].values)
reviews_tfidf1 = tf_idf_vect.transform(x_cv[‘question_text’].values)
reviews_tfidf2 = tf_idf_vect.transform(test[‘question_text’].values)

Baseline model:-

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn import model_selection, preprocessing, metrics, ensemble, naive_bayes, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn import metrics
lr = linear_model.LogisticRegression()
tfvec = TfidfVectorizer(stop_words=’english’, lowercase=False)
pipe_lr = Pipeline([
(‘vectorizer’, tfvec),
(‘lr’, lr )
])pipe_lr.fit(X_train, y_train)
y_pred_lr = pipe_lr.predict(X_val)
cm_lr = metrics.confusion_matrix(y_val, y_pred_lr)
ax = plt.gca()sns.heatmap(cm_lr, cmap=’Blues’, cbar=False, annot=True, xticklabels=y_val.unique(), yticklabels=y_val.unique(), ax=ax);
ax.set_xlabel(‘y_pred’);
ax.set_ylabel(‘y_true’);
ax.set_title(‘Confusion Matrix’);
cr = metrics.classification_report(y_val, y_pred_lr)
print(cr)

Feature engineering:

It might be obvious that the model learns the patterns and interactions while it trains. So, how does the engineering comes into the place? Also, there is a line where we should leave some interactions for models to find out and not hand label features. The need for feature engineering:

  1. Helps the model to catch those exceptions when you engineer a significant interaction
  2. The model converges faster if you happen to find a good set of features
  3. When you have a new source of information there is a chance to make a better model.

Adding new features:

Adding new features which help the model to understand the classification plays a major role in machine learning models to improve the score.

Bi-directional LSTM model :-

Bi-LSTM:(Bi-directional long short term memory):
Bidirectional recurrent neural networks(RNN) are really just putting two independent RNNs together. This structure allows the networks to have both backward and forward information about the sequence at every time step

Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backward you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.

inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
conc = concatenate([avg_pool, max_pool])
conc = Dense(64, activation=”relu”)(conc)
conc = Dropout(0.1)(conc)
outp = Dense(1, activation=”sigmoid”)(conc)
model = Model(inputs=inp, outputs=outp)
model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
Train on 1175509 samples, validate on 130613 samples
Epoch 1/3
1175509/1175509 [==============================] - 562s 478us/step - loss: 0.1335 - acc: 0.9510 - val_loss: 0.1027 - val_acc: 0.9593
Epoch 2/3
1175509/1175509 [==============================] - 560s 476us/step - loss: 0.1023 - acc: 0.9594 - val_loss: 0.0977 - val_acc: 0.9612
Epoch 3/3
1175509/1175509 [==============================] - 561s 477us/step - loss: 0.0951 - acc: 0.9619 - val_loss: 0.0958 - val_acc: 0.9617

Attention Layer:

“Attention” is defined as the “active direction of the mind to an object.” Attention is about choice — it is choice. Neural networks make choices about which features they pay attention to. Attention is both the currency of Silicon Valley, and the currency of AI.

At any given moment, our minds concentrate on a subset of the total information available to them. It is in a sense, the mind’s capital, the chief resource it can allocate and spend. Algorithms can also allocate attention, and they can learn how to do it better, by adjusting the weights they assign to various inputs. Attention is used for machine translation, speech recognition, reasoning, image captioning, summarization, and the visual identification of objects.

Translation often requires arbitrary input length and output length, to deal with the deficits above, the encoder-decoder model is adopted and basic RNN cell is changed to GRU or LSTM cell, hyperbolic tangent activation is replaced by ReLU. We use the GRU and LSTM layer here.

In neural networks, attention is a memory-access mechanism.

Fig-14: Attention Mechanism

Let’s say you are trying to generate a caption from an image. Each input could be part of an image fed into the attention model. The memory layer would feed in the words already generated, the context for future word predictions. The attention model would help the algorithm decide which parts of the image to focus on as it generated each new word (it would decide on the thickness of the lines), and those assignments of importance would be fed into a final decision layer that would generate a new word.

class Attention(nn.Module):
def __init__(self, feature_dim, step_dim, bias=True, **kwargs):
super(Attention, self).__init__(**kwargs)

self.supports_masking = Trueself.bias = bias
self.feature_dim = feature_dim
self.step_dim = step_dim
self.features_dim = 0

weight = torch.zeros(feature_dim, 1)
nn.init.xavier_uniform_(weight)
self.weight = nn.Parameter(weight)

if bias:
self.b = nn.Parameter(torch.zeros(step_dim))

Training the model:

Lets try CNN model:-

CNN is also computationally efficient. It uses special convolution and pooling operations and performs parameter sharing. This enables CNN models to run on any device, making them universally attractive.

All in all this sounds like pure magic. We are dealing with a very powerful and efficient model which performs automatic feature extraction to achieve superhuman accuracy CNN is trained the same way like ANN, backpropagation with gradient descent. Due to the convolution operation it’s more mathematically involved, and it’s out of the scope for this article. If you’re interested in the details refer here.

from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D
from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D
filter_sizes = [1,2,3,5]
num_filters = 36
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size * 4, weights=[embedding_matrix])(inp)
x = SpatialDropout1D(S_DROPOUT)(x)
x = Reshape((maxlen, embed_size * 4, 1))(x)
maxpool_pool = []
for i in range(len(filter_sizes)):
conv = Conv2D(num_filters, kernel_size=(filter_sizes[i], embed_size * 4),
kernel_initializer=’he_normal’, activation=’elu’)(x)
maxpool_pool.append(MaxPool2D(pool_size=(maxlen — filter_sizes[i] + 1, 1))(conv))
z = Concatenate(axis=1)(maxpool_pool)
z = Flatten()(z)
z = Dropout(DROPOUT)(z)
outp = Dense(1, activation=”sigmoid”)(z)model = Model(inputs=inp, outputs=outp)
model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
pred_val_y = pred_val_cnn_y # two random numbers :)
pred_test_y = pred_test_cnn_y
thresholds = []
for thresh in np.arange(0.1, 0.501, 0.01):
thresh = np.round(thresh, 2)
res = metrics.f1_score(val_y, (pred_val_y > thresh).astype(int))
thresholds.append([thresh, res])
print("F1 score at threshold {0} is {1}".format(thresh, res))
thresholds.sort(key=lambda x: x[1], reverse=True)
best_thresh = thresholds[0][0]
print("Best threshold: ", best_thresh)

As you can see that among the models like Logistic regression and CNN, the deep learning model i.e., Neural network using Attention Layer with LSTM and LSTM Layers gave the highest f1 score of 0.683.

References:

https://www.appliedaicourse.com