Use Transfer learning in BERT model to predict correct descriptive answer for open-ended questions

Source: Deep Learning on Medium

Use Transfer learning in BERT model to predict correct descriptive answer for open-ended questions

An open-ended question is a question that cannot be answered with a “yes” or “no” response, or with a static response. Open-ended questions are phrased as a statement which requires a response. The response can be compared to information that is already known to the questioner.

Examples of open-ended questions:

  • Tell me about your relationship with your supervisor.
  • How do you see your future?
  • Tell me about the children in this photograph.
  • What is the purpose of government?
  • Why did you choose that answer?

Problem

The problem is a simple questions and answer problem where the we have to predict whether an answer is correct for a given question or not but the cache here is that the answers are descriptive.For a given questions, there are 10 answers out of which 9 are incorrect and 1 is correct. An Example of a question and answers set is:

1) symptoms of air pollution exposure ?

Answer 1: Long-term exposure to polluted air can have permanent health effects such as: 1 Accelerated ageing of the lungs. 2 Loss of lung capacity and decreased lung function. 3 Development of diseases such as asthma, bronchitis, emphysema, and possibly cancer. 4 Shortened life span.

2) symptoms of air pollution exposure ?

Answer 2: Long-term exposure to particulate pollution can result in significant health problems including: 1 Increased respiratory symptoms, such as irritation of the airways, coughing or difficulty breathing. 2 Decreased lung function. 3 Aggravated asthma. 4 Development of chronic respiratory disease in children.

3) symptoms of air pollution exposure ?

Answer 3: Short-term effects of air pollution on health. It is possible that very sensitive individuals may experience health effects even on Low air pollution days. Use the Daily Air Quality Index to understand air pollution levels and find out about recommended actions and health advice. The advice here applies to anyone experiencing symptoms. Short-term effects. Air pollution has a range of effects on health. However, air pollution in the UK on a day-to-day basis is not expected to rise to levels at which people need to make major changes to their habits to avoid exposure; Nobody need fear going outdoors, but they may experience some noticeable symptoms depending on which of the following population groups they are in:

4) symptoms of air pollution exposure ?

Answer 4: Long-term exposure to particulate pollution can result in significant health problems including: Increased respiratory symptoms, such as irritation of the airways, coughing or difficulty breathing Decreased lung function

5) symptoms of air pollution exposure ?

Answer 5: Long-term exposure to polluted air can have permanent health effects such as: Accelerated ageing of the lungs; Loss of lung capacity and decreased lung function; Development of diseases such as asthma, bronchitis, emphysema, and possibly cancer; Shortened life span

6) symptoms of air pollution exposure ?

Answer 6: Short-term exposure to particulate pollution can: Aggravate lung disease causing asthma attacks and acute bronchitis; Increase susceptibility to respiratory infections; Cause heart attacks and arrhythmias in people with heart disease; Even if you are healthy, you may experience temporary symptoms, such as: Irritation of the eyes, nose and throat; Coughing

7) symptoms of air pollution exposure ?

Answer 7: Children with asthma may notice that they need to increase their use of reliever medication on days when levels of air pollution are higher than average. Adults and children with heart or lung problems are at greater risk of symptoms. Follow your doctor’s usual advice about exercising and managing your condition. It is possible that very sensitive individuals may experience health effects even on Low air pollution days. Anyone experiencing symptoms should follow the guidance provided.

8) symptoms of air pollution exposure ?

Answer 8: Long-term exposure to polluted air can have permanent health effects such as: Accelerated ageing of the lungs. Loss of lung capacity and decreased lung function. Development of diseases such as asthma, bronchitis, emphysema, and possibly cancer.

9) symptoms of air pollution exposure ?

Answer 9: Your actual risk of adverse effects depends on your current health status, the pollutant type and concentration, and the length of your exposure to the polluted air. High air pollution levels can cause immediate health problems including: 1 Aggravated cardiovascular and respiratory illness. 2 Added stress to heart and lungs, which must work harder to supply the body with oxygen. 3 Damaged cells in the respiratory system.

10) symptoms of air pollution exposure ?

Answer 10: Exposure to such particles can affect both your lungs and your heart. Long-term exposure to particulate pollution can result in significant health problems including: Increased respiratory symptoms, such as irritation of the airways, coughing or difficulty breathing; Decreased lung function; Aggravated asthma

Out of all these answers, 9th one is the correct answer. For all the correct questions, the label predicted must be 1 and for all the wrong answers, the labels predicted must be 1.Thus it is a binary classification problem.

Data link: https://drive.google.com/file/d/1zoWkF_R71JaCKBKf1jk0fVMV6ADEpmf/view?usp=sharing

Challenge

But some of the challenges in solving the problems are:

  • Highly imbalanced data. For one correct answer, there are 9 wrong answers.The imbalance ratio is 9:1.
plot showing distribution of correct and wrong answer classes in the dataset
  • No scope for feature engineering. Since each question is distinct and unique, there is no scope of any feature engineering that can be done

These challenges are tend to be overcome while solving the problem.

Approach

1) Data Cleaning

Simple data processing technique done is done such as

Removal of some of the html tags and entities such as <p>,<a>, &nbsp;

import re
#function to clean the word of any html-tags and make it lower case
def cleanhtml(sentence):
cleanr = re.compile('<.*?>')
cleanentity = re.compile('&.*;')
cleantext = re.sub(cleanr, ' ', sentence)
cleantext = re.sub(cleanentity, ' ', cleantext)
return cleantext.lower()

Since URL too has a special structure, normal data processing treatment will make it loose its meaning, also URLs doesn’t carry any significant meaning, hence removing it makes data effective

url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+';for i in range(preprocessed_questions.shape[0]):
preprocessed_questions[i] = re.sub(url_regex, '', preprocessed_questions[i]);
  • Removal of punctuations etc

Punctuations such as ;%& should be removed as they don’t carry significant meaning.Also the isn’t, haven’t etc are expanded as is not, have not etc

#function to clean the word of any punctuation or special characters
def cleanpunc(sentence):
cleaned = re.sub(r'[?|!|"|#|:|=|+|_|{|}|[|]|-|$|%|^|&|]',r'',sentence)
cleaned = re.sub(r'[.|,|)|(|\|/|-|~|`|>|<|*|$|@|;|→]',r'',cleaned)
return cleaned

Here, we have specifically omitted removal of stop words process due to certain reasons that will is revealed in later section.

2) Metrics Used

Considering that the data is highly imbalanced one, the metrics should also be the one that accommodates this fact. Hence the metrics chosen are

Confusion matrix

F1-score is used as it will be high only when precision and recall is high.

Accuracy is not a preferred metric here as the there is a high imbalance in data.Confusion matrix is also used here we must know the correctly and wrongly predicted number of data points in each of the class

3) Data science Model

3) Technique Used

The Machine learning technique used to solve this problem is deep learning.

Why use Deep Learning ?

The main reasons to choose deep learning to solve this problem is:

  • The problem doesn’t require interpretability and latency as constraints. Had these constraints been there it using deep learning techniques would not be a right option as deep learning model requires input being passed to all its layers takes a lot of time to make the label prediction and there is no notion of interpretability in deep learning as they are complex models
  • No feature engineering seems to be possible for the problem.Hence a deep learning technique which automatically learns features from the data will be very helpful for the problem.

Why BERT ?

Several Neural network architecture are tried such as below:

Neural network with 5 hidden layers
Neural network with four 1-Dimensional convolutional network
Neural network with both 1-dim convolutional layer and LSTM layers
Neural network with LSTM Layers and time distributed dense layers

First four models are trained with two types of embeddings of the data:

  1. Glove vectors
  2. TF-IDF Weighted Glove vectors

The fifth layer is has a data embedding layer which does embedding for the data as part of learning.But none of them worked.Even hyper-parameter tuning and increasing the data size didn’t increase.The prediction results for class 1 still is very poor and all the model trained are dumb models since the problem is much harder to be solved by these simple architectures.

Confusion matrix obtained

We can go on and try and experiment different architecture that would work for this problem but that would be computationally expensive.It requires a lot of computational power to train and experiment with these problems.

Here is where transfer learning saves the day.Instead of trying multiple architectures as it is computationally intensive, it is best to use transfer learning and fine tune a pre trained NLP model to solve this problem. BERT is the state of the art NLP model conceived by google and hence BERT is chosen for transfer learning

4) BERT

BERT (Bidirectional Encoder Representations from Transformers) is the recent paper published by Google and is currently the state of the art technique for all NLP models.The major innovation in the BERT being its attention technique where it masks few of the words in the input and try to predict it from the context information which is the words around it and contrast to the previous NLP model which processes data from left to right , it processes input from both left to right and right to left.

Pytorch provides a pre-trained BERT model which is fine-tuned for our data.The model is trained in the google colab by using its GPU.

To train any deep learning model with the google colab GPU, open your jupyter notebook in google colab and click EDIT → NOTEBOOK SETTINGS and then on the ensuing dialog box, select GPU as the hardware accelerator and then click SAVE.

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda()

Input Format:

The input is fed in the in the form of “questions ? answers” to the BERT.This technique worked better than the simple concatenation of questions and answer when training a bert model and hence this is used.

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.0}
]
optimizer = BertAdam(optimizer_grouped_parameters,
lr=2e-5,
warmup=.1)

Training the BERT model

# Store our loss and accuracy for plotting
train_loss_set = []
# Number of training epochs (authors recommend between 2 and 4)
epochs = 4
# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):


# Training

# Set our model to training mode (as opposed to evaluation mode)
model.train()

# Tracking variables
tr_loss, tr_accuracy, tr_f1_score = 0, 0, 0
nb_tr_examples, nb_tr_steps = 0, 0
# Train the data for one epoch
for step, batch in enumerate(train_dataloader):
if nb_tr_steps%1000 == 0:
print("Step", nb_tr_steps)
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Clear out the gradients (by default they accumulate)
optimizer.zero_grad()
# Forward pass
loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
train_loss_set.append(loss.item())
# Backward pass
loss.backward()
# Update parameters and take a step using the computed gradient
optimizer.step()

with torch.no_grad():
# Forward pass, calculate logit predictions
logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

# Move logits and labels to CPU
#print("logits", logits)
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
tmp_tr_accuracy = flat_accuracy(logits, label_ids)
tmp_f1_score = f1(label_ids, logits)

tr_accuracy += tmp_tr_accuracy
tr_f1_score += tmp_f1_score

# Update tracking variables
tr_loss += loss.item()
nb_tr_examples += b_input_ids.size(0)
nb_tr_steps += 1
print("Train loss: {}".format(tr_loss/nb_tr_steps))
print("Train Accuracy: {}".format(tr_accuracy/nb_tr_steps))
print("Train f1-score: {}".format(tr_f1_score/nb_tr_steps))


# Validation
# Put model in evaluation mode to evaluate loss on the validation set
model.eval()
# Tracking variables
eval_loss, eval_accuracy, eval_f1_score = 0, 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
# Evaluate data for one epoch
for batch in validation_dataloader:
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Telling the model not to compute or store gradients, saving memory and speeding up validation
with torch.no_grad():
# Forward pass, calculate logit predictions
logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

# Move logits and labels to CPU
logits = logits.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
tmp_eval_accuracy = flat_accuracy(logits, label_ids)
tmp_f1_score = f1(label_ids, logits)

eval_accuracy += tmp_eval_accuracy
eval_f1_score += tmp_f1_score
nb_eval_steps += 1
print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
print("Validation f1-score: {}".format(eval_f1_score/nb_eval_steps))

4) Results

The result was actually magically.Other than typically simple neural networks which I trained and which was a completely dumb model, BERT worked like a charm

Confusion matrix, precision and recall of the BERT model

The BERT model gives a very good predictions for class 1.The class 0 has a very good recall value of 0.95 while that of the class 1 gives a comparatively lower recall value 0.755.The F1-Score obtained is 0.89 and the accuracy of 0.89 indicates that the model performed pretty well.

But the results will improve well if we train the BERT for large dataset and to a bit high number of epochs.Due to the memory and session timeout constraints of the BERT model, it was very difficult to train the model for higher epoch with larger amount of data.But with higher computational and memory power, the result could be improved.

The above case study was a nice example to show that for solving an deep learning problem when we don’t have sufficient infrastructure and computational resource, transfer learning is the best technique that works like a charm 🙂

References:

  1. https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
  2. https://mccormickml.com/2019/07/22/BERT-fine-tuning/

My github link to the entire work: https://github.com/gayathriabhi/BERT-model-to-predict-the-correct-descriptive-answer

LinkedIn Profile: https://www.linkedin.com/in/gayathri-s-a90322126/