Google Quest Question and Answer

Original article was published by Varun Saproo on Deep Learning on Medium


Google Quest Question and Answer

In this article, we’ll do a hands-on to a Kaggle challenge — Google Quest Q&A. As this challenge is an NLP task, this challenge uses blending of CNN-LSTM + XLNet transformer. This solution scores in top-7% of Kaggle submissions.

Business Problem

Computers are good at answering questions with single verifiable answers. For Example, querying “Who is the Prime Minister of India?” on google, will give a perfect answer. When it comes to answering subjective aspects of a question, Humans do a much better job than what computers do. Few subjective aspects include -:

  1. Is the question understandable?
  2. Is the question conversational?
  3. Is the answer to the question understandable?
  4. and many more…….

The CrowdSource team at Google Research, has collected data on a number of these subjective aspects for each question-answer pair. Crowdsource gathers your feedback, and feedback from others around the world which helps the machine to learn from accurate examples and improves the services provided by google like Maps, Translate etc.

The question-answer pairs were gathered from nearly 70 different websites. The raters received minimal guidance and training, and relied largely on their intelligence to answer subjective aspects of the prompts. As such, each prompt was simplified in such a way so that raters could simply use their common-sense to complete the task.

The task here is to build a predictive algorithm which would quantify these subjective aspects given a question-answer pair.

Evaluation Metric — The Evaluation Metric for this competition is Spearman Rank Correlation Coefficient. The Spearman’s rank correlation is computed for each target column, and the mean of these values is calculated for the submission score.

Use of ML/DL Models

This is an NLP based problem, where question_title, question_body, answer are used as input to predict the subjective properties of question-answer pair as output.

Recently, DL approaches have successfully achieved state of the art performance comparable to humans may it be language translation, sentiment analysis etc.

Dataset

The Dataset is provided by the crowdsource team at Google and hosted at Kaggle. Train Dataset has 6079 data points while Public Test Dataset has 476 data points. The Private Test Data is not disclosed. The dataset contains three text based columns namely question_title, question_body and answer.

The data has 30 target variables. The values of these variables are to be predicted for a given question-answer pair. The scores for the model will be evaluated based on the Public and Private test data.

Following are the target variables in the dataset -:

A view on the values of the targets:

Each target is discrete in nature, and values lie between [0, 1]

URL — Each Question-Answer pair has an associated URL. There are duplicate URLs in the dataset which means that there are pairs of duplicate questions with multiple answers as each datapoint.

Exploratory Data Analysis

EDA is used to understand and summarise the main characteristics of a dataset. Here, I have used EDA to identify whether the train and test data distributions are similar or not.

Domain -:

Observation — Large number of examples are fetched from stackoverflow domain

1). The distribution of domains in both train and test data are different.

2). Large number of examples are fetched from stackoverflow domain

Categories -:

Total number of categories in the dataset : 5

Image by : Author

Observation — Large number of examples are fetched from stackoverflow domain

1). The distribution of category in both train and test data are different.

Text Preprocessing – Before analysis of text or modelling ML models on text, it is essential to preprocess your text data. This is because, unprocessed text hides a lot of nformation. For example, if you simply tokenize the sentence : ‘However, there were many challanges.”, tokens of this text will be [‘However,’, ‘there’, ‘were’, ‘many’,’challanges.’]. [‘there’, ‘were’, ‘many’] are valid english words and will be in the vocabulary, but [‘However,’, ‘challanges.’] will not be in the vocabulary even though they are valid english words. Hence, we need to preprocess text data to minimize such situations and leverage information. Further EDA is done on preprocessed text.

Below is the code for preprocessing text :

Code by : Author

Sentiment Polarity

The above image shows the distributions of polarity score. Legends represent the train and test data respectively. The name below the plot represents the polarity associated with the column name.

Observations

1). Title polarity for both train and test data has high density near 0, meaning that most of the titles are neither positive nor negative.

2). The distributions for both train and test data are similar

Observations — Above plots represent the # of counts associated with the column.

1). Title polarity for both train and test data has high density near 0, meaning that most of the titles are neither positive nor negative.

2). The distributions for both train and test data are similar

3).Both question_body_count & answer_count look similar to power law distribution.

Bi-Grams

Observation — Above plots show that the cardinality of intersection of top N frequent bigrams in train text and top N frequent bigrams in test text is as close to the cardinality of union of the two sets.

Feature Engineering

It is the process of developing features using domain knowledge, which helps in building better models. Below are the handcrafted features.

Token — Components of a text which you obtain by splitting up text on spaces.

Stopwords — A stopword is a commonly used word such as ‘a’, ‘an’, ‘the’ etc. Such words are typically removed from the text during preprocessing.

Word — A token which is not a stopword.

cwc_min = ratio of common_word_count to min(# of T1 words, # of T2 words )cwc_max = ratio of common_word_count to max(# of T1 words, # of T2 words)csc_min = ratio of common_stopwords_count to min(# of T1 stopwords, # of T2 stopwords )csc_max = ratio of common_stopwords_count to max(# of T1 stopwords, # of T2 stopwords )ctc_min = ratio of common_token_count to min(# of tokens in T1, # of tokens in T2)ctc_max = ratio of common_token_count to max(# of tokens in T1, # of tokens in T2)word_t1_to_t2_ratio = (# of words in T1) / (# of words in T2)token_t1_to_t2_ratio = (# of tokens in T1) / (# of tokens in T2)last_word_eq = boolean(T1[-1] == T2[-1])first_word_eq = boolean(T1[0] == T2[0])word_t1_to_t2_ratio = (# of words in T1) / (# of words in T2)token_t1_to_t2_ratio = (# of tokens in T1) / (# of tokens in T2)last_word_eq = boolean( T1[-1] == T2[-1] )first_word_eq = boolean( T1[0] == T2[0] )

Fuzzywuzzy — These features take a pair of text as input and returns the score out of 100 as an output. Internally this library uses SequenceMatcher class to compute %age similarity between pairs of texts. Implementation details are provided in the above link.

  • Fuzz ratio — It simply uses SequenceMatcher to compute similarity between pairs of text. The SequenceMatcher outputs the similarity between 0 to 1. This fuzz ratio converts decimals to percentage i.e. 0 to 100.
  • Partial Fuzz Ratio — If the pair of text have different lengths, i.e. smaller string of length m, larger string of length n, this feature computes ratio based on the best m-length substring. Then, SequenceMatcher is used to compute ratios between the m-length substring and the smaller string.
  • Token Ratio — In this approach, each text is tokenized and these tokens are sorted based on alphabetical order. The SequenceMatcher is then used to compute ratios.
  • Token Set Ratio — The approach is as follows -:
1) Create a set of intersection of tokens between T1 and T2.
2) S2 = S1 + Create a set of tokens present in T1 but not in T2
3) S3 = S1 + Create a set of tokens present in T2 but not in T1
4) return max(score(S1, S2), score(S1, S3), score(S2, S3)). Scores are computed using SequenceMatcher

Embeddings — The embeddings used is by Facebook named fasttext 300-d.

Final Solution

I have tried various ML models, but could not get a good score. Hence, switched to Deep learning techniques.

This solution involves blending of two deep learning models namely CNN-LSTM and XLNet.

In both the models, loss function used is Binary Cross Entropy.

CNN-LSTM

In this model, CNN-1D is used prior to LSTM. CNN-1D can be used to harness information from data with additional time axis. For example, let the maximum number of tokens be 512. On replacing each token with their associated word embedding, we will get an input with shape (batch_size, 512, word_embedding_dim). Here 512 is the number of time steps. For such inputs, apart from LSTM, we can also use CNN-1D. Below is the diagram for the model -: