Jigsaw Unintended Bias in Toxicity Classification

Source: Deep Learning on Medium

In this blog, I’ll explain how to create a model that can classify the toxicity (toxic and non-toxic) the comments and to minimize the unintended bias. I’ll explain the meaning of unintended bias and toxicity in later sections of this blog.

Business Problem

Problem Description

The Conversation AI team (it is research initiated by Jigsaw and Google) build a toxicity model, they found that the model incorrectly learned to associate the names of frequently attacked identities with toxicity. So the model predicted high toxicity for those comments which contain words like gay, black, Muslim, white, lesbian, etc, even when comments were not actually toxic (e.g. I am a gay woman.). This happened because the dataset was collected from the sources where such words (or identities) are considered as highly offensive. A model is needed to be build which can find the toxicity in the comments and minimize the unintended bias with respect to some identities.

  • A toxic comment is that comments which are offensive and sometimes they make some people leave the discussion (on public forums).
  • unintended bias is related to unplanned bias which happened because the data was collected from such sources which considered some words (or identities) very offensive.

Problem Statement

The model which was built by The Conversation AI team has the problem of unintended bias and this has to be removed (or minimized) because of this the comments which are not actually toxic will be predicted as toxic and this is not good for our business.

Objective and Constraint

Objective

  • Predicting whether a comment is toxic or not.
  • Minimize unintended bias.

Constraint

  • No strict latency requirements.

Data

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

Data Overview

Dataset for this study is downloaded from Kaggle. There is one training data file named as train.csv and one test file named as test.csv. There are 45 columns in train data and two columns in test data. The names of two columns of test data are id & comment_text. So only text data will be used to train the models. There are many identities in train data but only a few are required and these identities are male, female, homosexual_gay_or_lesbian, christian, jewish, muslim, black, white, psychiatric_or_mental_illness. These identities will help in calculating the final metric.

Mapping the real-world problem to a Machine Learning Problem

Type of Machine Learning Problem

This is a binary classification task. Target label 0 means non-toxic comments and target label 1 means toxic comments.

Performance metric

Its basic performance metric is AUC (area under the curve) but we’ll be using one more metric that combines the overall-AUC and bias-AUC. Go through this and this link to understand it much better. This is our primary metric and by maximizing this value we can reduce the unintended bias in our prediction. I’ll use the confusion matrix for the whole data and confusion matrix for each identity as a secondary metric. It will help me in understanding that my model is doing a good/bad job for which ‘identity’.

Let’s see what will be our final score (metric).

Overall AUC

This is the ROC-AUC for the whole dataset.

Bias AUCs

Most of the metrics for unintended bias divides the test data based on identity or demographic subgroups and then calculate the metric for each subgroup individually. For our Bias-AUC metrics, we’ll also divide the data by subgroup (or identity). However, instead of calculating the metrics for subgroups individually, our metric compares the subgroup to the rest of the data, which we call the “background” data.

Let’s define some parameters to define what our three Bias-AUC metrics will be:

Above all four sets are subsets of the whole dataset.

NOTE: Here background data means that it is a subset of whole data that does not include a subgroup/identity. When we’ll be calculating our bias-AUCs we’ll use the union of this background data and this subgroup.

Subgroup AUC: This will calculate AUC only on the given subgroups. Make two subsets of the dataset, one will have the corresponding subgroup with the target label as 1 and the second subset will have the corresponding subgroup with the target label as 0. Now take the union of these two subsets and then calculate the AUC on this set. In more simple word get all examples (positive and negative) for a subgroup and then calculate the AUC.

BPSN (Background Positive, Subgroup Negative) AUC: This will calculate the AUC on the positive examples in the background and negative examples in the subgroup. Make two subsets from the dataset, the first subset will have the data points where the target label is 1 and given subgroup is False, the second subset will have the data points where target label is 0 for the given subgroup/identity. Now take the union of these two subsets and calculate the AUC value.

BNSP (Background Negative, Subgroup Positive) AUC: This will calculate the AUC on the negative examples in the background and positive examples in the subgroup/identity. Make two subsets from the dataset, the first subset will have the data points where the target label is 0 and given subgroup is False, the second subset will have the data points where target label is 1 for the given subgroup/identity. Now take the union of these two subsets and calculate the AUC value.

Generalized Mean of Bias AUCs: To combine the per-identity Bias AUCs into one overall measure, we calculate their generalized mean as defined below —

Note: Kaggle has set the value of p, p =-5.

Final Metric/Score/AUC

Exploratory Data Analysis

After doing some google search and going through multiple articles I designed 14 new features.

  1. word_count: Total Number of words in the sentence
  2. char_count: total number of characters in the sentence (word_count <= char_count)
  3. word_density: Density of words in the sentence
  4. total_length: total length of the sentence (it includes extra spaces, special character, etc.)
  5. capitals: Number of capital character in the sentence
  6. caps_vs_length: the ratio of the number of capital words to the total length of the sentence. caps_vs_lenght=capitalstotal_lenght
  7. punc_count: Number of punctuation in the sentence
  8. num_exclamation_marks: Number of exclamation marks (!)
  9. exlamation_vs_punc_count: the ratio of the number of exclamation marks to the total number of punctuation. exlamation_vs_punc_count=num_exclamation_markspunc_count
  10. num_question_marks: The number of question marks (?)
  11. question_vs_punc_count: the ratio of the number of question marks to the total number of punctuation.
  12. num_symbols: Number of symbols (@, #, $, %, ^, &, *, ~)
  13. num_unique_words: Number of unique word in the sentence
  14. words_vs_unique: ratio of unique words to total word words_vs_unique=num_unique_wordword_count

Text data contains symbols, special characters, HTML tags, etc. I’m going to remove them using the below code.

Univariate Analysis of Hand-Crafted Features

There are 14 features, let’s look at the box-plot, violin plot, and distribution plot of some features.