Question Answering with PyTorch Transformers: Part 3

Source: Deep Learning on Medium

Let’s adapt this example to evaluate contexts we fetched out of Part 2. In Part 2, we combined the questions and contexts into a dataframe and cached it to disk. Let’s open it up in a new notebook and work with it from there.

device = torch.device('cuda'
if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering \
.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad') \
.to(device)
question_df = pd.read_feather("question_context.feather")
question_df

Output:

<Sorry, too ugly for a blog. Go check out the notebook...>

I’m opting to use GPU acceleration if available but this isn’t strictly necessary.

question, context = question_df[["question", "context"]].iloc[1]
question, context

Output:

('When did the last country to adopt the Gregorian calendar start using it?',
'During the period between 1582, when the first countries adopted the Gregorian calendar, and 1923, when the last European country adopted it, it was often necessary to indicate the date of some event in both the Julian calendar and in the Gregorian calendar, for example, "10/21 Febru...

Combine question and context into a string and encode

input_text = "[CLS] " + question + " [SEP] " + context + " [SEP]"
input_ids = tokenizer.encode(input_text, add_special_tokens=False)
token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]
input_ids[:10], token_type_ids[:20]

I’m passing add_special_tokens=False to the tokenizer, otherwise it will insert duplicate [CLS] and [SEP] tokens.

with torch.no_grad():
start_scores, end_scores = model(
torch.tensor([input_ids], device=device),
token_type_ids=torch.tensor([token_type_ids], device=device))
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(' '.join(all_tokens[torch.argmax(start_scores) :
torch.argmax(end_scores)+1]))
print(f'score: {torch.max(start_scores)}')

Output:

1923
score: 8.123082160949707

Well, this is different. This is the same answer we got from the pipeline API but back then, the score was a number between 0.0 and 1.0.

sns.distplot(start_scores.cpu())

In this histogram, we can see that the bulk of the start_scores lie between -10 and -5. This is a result of the loss function used in during training which applies applies the softmax function and calculates the cross entropy against the true start position.

The softmax squishes the range of values to be between 0 and 1 and normalizes the outputs so that they sum to 1. Put another way, it turns the activation values into probability values. Negative activations become close to 0 while strong positive values are closer to 1 (TODO bar charts to demonstrate). The loss function penalizes the model more when it is overconfident and wrong than when it is wrong and unsure. (TODO explain why)

Let’s compare this to another example.

question, context = question_df[["question", "context"]].iloc[0]
question, context

Output:

('When did the last country to adopt the Gregorian calendar start using it?',
'"Old Style" (OS) and "New Style" (NS) are sometimes added to dates to identify which system is used in the British Empire and other countries that did not immediately change. Because the Calendar Act of 1750 altered the start of the year, and also aligned the British calendar with the Gregorian cale...

This passage is about adoption of the Gregorian calendar, but there is no mention of Britain being the last country. The model detects that the question is asking about a date or year so it focuses on “1750”.

with torch.no_grad():
start_scores, end_scores = model(
torch.tensor([input_ids], device=device),
token_type_ids=torch.tensor([token_type_ids], device=device))
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
sns.distplot(start_scores.cpu(), kde=False, rug=True)

However, it’s not so confident that this answers the question.