Correct/Incorrect Answer of Question with Machine Learning

Source: Deep Learning on Medium

After looking at the above figures, it can be seen that almost all hand-crafted features are not very good to classify the classes because the distribution of both classes is almost overlapping. So they may not play a very important role in classification. But I’m going to keep them to train my models.

Multi-Variate Analysis

Visualizing high dimensional is very hard. One way to visualize high dimensional is that embed high dimensional data into a low dimension (where you can visualize easily). I’m going to embed this 13-dimensional features into 2-dimensions using t-SNE.

NOTE: The t-SNE algorithm is a probabilistic algorithm, so I’ll train for various perplexity and iterations. ‘perplexity’ and ‘n_iter’ are the parameters of the algorithm.

It is very expensive to train, so I’ll pick 5000 data points and see the result.

Code to embed and visualize the results:

the t-SNE plot of Hand-Crafted features

I’ve taken only 5k data points to visualize the hand-crafted features. But after looking at the above plots it can be seen that these features are not going to play a very important role in classification.

Data Splitting

I’m going to split in 80%–10%–10%. I’ll use stratified sampling.

def split_data(df):
"""
this will split the data into 80-10-10
"""
X = df[[col for col in df.columns if col != "label"]]
Y = df[["label"]]
X_train, X_test, y_train, Y_test = train_test_split(X,Y, train_size=0.8,stratify=Y, random_state = 42)
X_test, X_cv, y_test, y_cv = train_test_split(X_test,
Y_test,train_size=0.5,stratify = Y_test,random_state = 42)
return X_train, y_train, X_cv, y_cv, X_test, y_test
X_train, y_train, X_cv, y_cv, X_test, y_test = split_data(df) print("Number of datapoints in train data: {:,}\n\
Number of datapoints in CV data: {:,}\n\
Number of datapoints in test data: {:,}".format(X_train.shape[0], X_cv.shape[0],X_test.shape[0]))
--------------------------------------------------------------
Number of datapoints in train data: 4,136,297
Number of datapoints in CV data: 517,038
Number of datapoints in test data: 517,037

Stability of question and answer in train/CV/test dataset after splitting

In this section, let’s find out whether the question and answer are stable over train, CV, and test dataset or not. To check this, I’ll see what percentage of CV/test words of questions/answers are present in the training dataset.

The below code will create sets of words of questions/answers in train, CV and test dataset.

def return_set_words(data_frame, col_name):
"""This will return set of words"""
corpus = data_frame[col_name].values
big_text = ""
for sen in corpus:
big_text += str(sen)
return set(big_text.split())
# for questions
train_question_words_set = return_set_words(X_train, "question")
cv_question_words_set = return_set_words(X_cv, "question")
test_question_words_set = return_set_words(X_test, "question")
# for answers
train_answer_words_set = return_set_words(X_train, "answer")
cv_answer_words_set = return_set_words(X_cv, "answer")
test_answer_words_set = return_set_words(X_test, "answer")

51.73% of words of questions of the CV dataset and 51.72% of words of questions of the test dataset are present in the training dataset. 48.95% of words of answers of the CV dataset and 48.86% of words of answers of the test dataset are present in the training dataset. Let’s visualize this thing using the Venn diagrams.

For question

# https://pypi.org/project/matplotlib-venn/
from matplotlib_venn import venn3, venn2
fig, (ax1, ax2, ax3) = plt.subplots(1,3,figsize = (20,7))
out1 = venn2([train_question_words_set, cv_question_words_set], set_labels = (['Train', 'CV']), ax = ax1)
out2 = venn2([train_question_words_set, test_question_words_set], set_labels = (['Train', 'Test']), ax = ax2)
out3 = venn3([train_question_words_set, cv_question_words_set, test_question_words_set], set_labels = (['Train', 'CV','Test']), ax = ax3)
# toc chnage the font size: https://stackoverflow.com/a/29426251/12005970for out in [out1, out2, out3]:
for text in out.set_labels:
text.set_fontsize(14)
for text in out.subset_labels:
text.set_fontsize(12)
plt.show()

For answer

fig, (ax1, ax2, ax3) = plt.subplots(1,3,figsize = (20,7))
out1 = venn2([train_answer_words_set, cv_answer_words_set], set_labels = (['Train', 'CV']), ax = ax1)
out2 = venn2([train_answer_words_set, test_answer_words_set], set_labels = (['Train', 'Test']), ax = ax2)
out3 = venn3([train_answer_words_set, cv_answer_words_set, test_answer_words_set], set_labels = (['Train', 'CV','Test']), ax = ax3)
for out in [out1, out2, out3]:
for text in out.set_labels:
text.set_fontsize(14)
for text in out.subset_labels:
text.set_fontsize(12)
plt.show()

Machine Learning Models

Vectorization

After doing an experiment, I decided to vectorize the text data using BoW (bi-grams). In the experiment, I found that BoW (bi-gram) is performing much better than TF-IDF vectorizer. After vectorizing the text data, I concatenated it with the hand-crafted features and feed into the models. I did not use the W2V pre-trained model to vectorize text data for classical ML models. But I used the W2V model for embedding in deep learning models.

Classical ML Models

This dataset is quite big, so randomly sampled 1 million data points and then did the data splitting. Using this sample of data I tunned and trained logistic regression, linear-SVM, decision tree, random forest, and GBDT. But none of them gave me good performance. Out of all these models, GBDT was giving the best AUC value of test data. The best AUC value that I got using GBDT is 0.6152. If you want to know how I’ve tunned and trained these models go to my notebook.

NOTE: Only a sample of whole dataset is used to train the models.

Classical ML did not give me good performance, so I trained various deep learning models.

Deep Learning Models

In deep learning also, I trained various architectures. I’ll describe the DL models briefly which did not work very well. First of let’s define the auc function which will use sklearn roc_auc_score to calculate the AUC value.

def auc(y_true, y_pred):
return tf.py_func(roc_auc_score, (y_true, y_pred), tf.double)

MLP

In MLP each neuron (unit) is connected to every output of the previous layer. I’ve used dropout and batch normalization to apply regularization. But the performance of this model is same as the best model of classical ML model. The number of trainable parameters in this model is 632k (approx.)

Till now, I’ve not used the sequence information of text data and word embedding to vectorize the words.

Word Embedding

For word embedding, I’m using a pre-trained W2V model that is trained on Wikipedia data. I’m using that model which represents each word into a 200-dimensional vector. There are more W2V models on this site and the model that I used can be downloaded from here. After downloading let’s load this model. I’ll create a dictionary that will store this data. The keys of the dictionary will be the words and values of the dictionary will be 200-dimensional vector. This model has 400k tokens (words).

file_name = "glove.6B.200d.txt"
print("Loading the W2V model...")
with open(file_name, 'r') as f:
w2v_loaded_dict = {}
for line in f:
values = line.split()
word = values[0]
vector = [float(i) for i in values[1:]]
w2v_loaded_dict[word] = vector
# to get all words of W2V model
glove_words = w2v_loaded_dict.keys()

The above code is inspired by machinelearningmastery.com.

Before training an Embedded model we need to convert text data into sequences. Each sequence will correspond to a word and using this sequence let’s create an embedding matrix. I’ll use keras Tokenizer to convert into the sequence.

def text_to_seq(texts, keras_tokenizer, max_len):
"""this function return sequence of text after padding /truncating"""
x = pad_sequences(keras_tokenizer.texts_to_sequences(texts),
maxlen = max_len, padding = 'post',truncating = 'post')
return x

The maximum length of questions and answers in training data is 27 and 223 respectively. But I’ll use 50 and 250 as the maximum length of question and answer.

Code to get the embedded matrix and text converted into sequence:

I’ll also use my designed features to train the model. The below code will give the standardized designed features.

def return_hand_crafted():
"""
this will return standardized designed features
and this will also save the model

"""
train_hand_crafted = X_train[X_train.columns[2:]]
cv_hand_crafted = X_cv[X_cv.columns[2:]]
test_hand_crafted = X_test[X_test.columns[2:]]
std = StandardScaler()
train_hand_crafted = std.fit_transform(train_hand_crafted)
cv_hand_crafted = std.transform(cv_hand_crafted)
test_hand_crafted = std.transform(test_hand_crafted)
with open("Models/std.pkl", "wb") as f:
pickle.dump(std, f)
return train_hand_crafted, cv_hand_crafted, test_hand_crafted
# let's create the vectors
train_hand_crafted, cv_hand_crafted, test_hand_crafted = return_hand_crafted()
encoded_que_train, encoded_que_cv, encoded_que_test, encoded_ans_train, encoded_ans_cv, encoded_ans_test, embedding_matrix1 = return_sequnece_embed_matrix(tokens)

Now we’re ready to train embedding based architecture.

Conv1D

Conv1D takes three-dimensional tensor input (batch_size, steps, channels). In our case channels are the number of embedded dimensions for each word and the value of this is 200 (for first convolution layer). The performance of this model is very good. This model gave 0.66513 AUC value on CV dataset and 0.7176 AUC value on training dataset. This architecture has 184k trainable parameters.

LSTM Based Architecture

Till now, I did not use the sequence information and long-term dependency in our dataset. To use the long-term dependency we need to use LSTM/GRU units in our architecture. The performance of this model is better than all models (that has been used above). This model has 834k trainable parameter. This model gave 0.68482 AUC value on CV dataset and 0.7329 AUC value on training dataset. But after training more epochs, the model starts to overfit. So I used EarlyStopping and ModelCheckpoint of keras to save the based model (that perform well on CV dataset) and stop the training if the loss on CV dataset does not improve.

NOTE: All the above has been trained using 1 million data points only.

Bi-direction GRU model

In the above architecture, I used LSTM units and it was overfitting. Now I want to use a bi-directional based model. I’m going to use GRU units in place of LSTM units because GRU has less number of parameters than LSTM and it is cheaper to train.

model = create_model(max_lenght_que, max_length_ans)