Measuring Unintended Bias in text classification

Source: Deep Learning on Medium

Understanding metrics behind Jigsaw Unintended Bias in Toxicity Classification challenge

Originally posted on Jash Data Sciences Blog

What do you mean by ‘Unintended Bias’ ?

Kaggle held a competition that aimed at classifying the online comments on the basis of their toxicity scores. But somehow, the models ended up classifying the non-toxic comments as toxic. Comments mentioning frequently targeted minority communities/identities like ‘blacks’, ‘gays’, ‘muslim’, etc., were being classified as toxic, even when they were not of toxic nature.

For an instance the comment, “A muslim, is a muslim, is a muslim.” is not toxic, yet it is classified as toxic. All the comments related to one of these identities are grouped together as a subgroup. This gives us a set of subgroups, each related to one identity. Now, let us learn more about these Identity Subgroups.

‘Identity Subgroups’ refer to the frequently targeted words/groups (e.g. words like “black”, “muslim”, “feminist”, “woman”, “gay” etc).

Why does the bias exist?

Many comments that mentioned the identities that are targeted frequently are toxic. Hence, Deep Learning models learn to associate these identity words as toxic, essentially classifying comments that merely mention them as toxic comments.

The table attached below, is an example posted by Jessamyn West, on her twitter account. It is seen that the identity subgroups: man, woman, lesbian, gay, dyke, black and white have been interpreted as toxic. The sentence which has the combination of the three subgroups ‘woman’, ‘gay’, and ‘black’ has the highest toxicity rate.

The table reflects the error of the models that classified the subgroups as toxic, even when they were not.

Examples of mis-classified comments

Measuring the Unintended Bias

An Identity group can be defined as a bunch of comments that have some mention of a particular ‘identity’ in it. Everything that doesn’t belong to Identity Group goes to the Background group.

To obtain better results and reduce the bias, the dataset can be divided into two major groups — Background and Identity groups. Each group can be divided into two groups which contain positive and negative examples each. Therefore there are 4 subsets.

Next step is the calculation of Area Under Curve — Receiver Operating Curve (AUC-ROC). AUC — ROC curve is a performance measurement for classification problem at various thresholds settings.

Three AUCs to measure the negative/positive mis-orderings between the subsets are defined as follows:

a. Subgroup AUC — This calculates AUC on only the examples from the subgroup. It represents model understanding and performance within the group itself.

A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

b. BNSP AUC — This calculates AUC on the positive examples from the background and the negative examples from the subgroup.

A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not.

c. BPSN AUC — This calculates AUC on the negative examples from the background and the positive examples from the subgroup.

A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not.

NOTE: Looking at these three metrics together for any identity subgroup will reveal how the model fails to correctly order examples in the test data, and whether these mis-orderings are likely to result in false positives or false negatives when a threshold is selected.

With this understanding, now we can calculate the final metric that measures the bias in the dataset.

Final Metric Calculation

These three AUC scores need to be combined in order to arrive at a final metric, to measure the bias. The final metric is calculated in the following manner:

final = (x * overall_auc) + ((1- x) * bias_score) 
where, x = Overall Model Weight (which is take as 0.25 here)

Mathematically, the final metric is calculated with the below formula:

Following are the AUC scores calculated for each identity subgroups:

The final score metric calculation has 2 variables, namely:

a. Overall AUC — It is calculate by taking the ROC-AUC for the full evaluation set.

b. Bias Score — It is calculated by taking the average of the power means of all 3 submetric (Subgroup AUC, BNSP AUC and BPSN AUC).

Following is the code for calculating the power mean of each submetrics:

def power_mean(series, p):
total = sum(np.power(series, p))
return np.power(total / len(series), 1 / p)

With the value of p = -5, the power means of all submetrics are:

print(power_mean(bias_df[SUBGROUP_AUC], POWER))
>> 0.8402076852862012
print(power_mean(bias_df[BNSP_AUC], POWER))
>> 0.9440122934808436
print(power_mean(bias_df[BPSN_AUC], POWER))
>> 0.8228592084505357

What is the difference between simple mean and power mean?

As seen above, the power mean is being calculated for all three submetrics. What if, instead of taking the power mean, we calculated the simple mean?

Simple mean is the power mean with p value = 1. Following is the table that contains 4 different data sets, their means and their standard deviations.

The aim is to get a higher accuracy for all the identity subgroups. Increasing the accuracy for few identity subgroups at the cost of others is not the motive. By doing this, the mean might remain same but the accuracy for few identity subgroups will be lower.

Therefore, taking the power means of all groups will help recognize the high range of scores. Additionally, a higher power value will punish the low scoring metrics severely. Hence, taking -5 as the power value, punishes the least scoring subgroup until it becomes better.


What if we see the above toxicity classification problem in a different domain, i.e., a different country/region? The identity subgroups that are often referred to as toxic, will change. For an instance, India will have casteism, religions, financial status, etc., as the identity subgroups. On the other hand, U.S. will have racist identity subgroups.

The overall ROC-AUC calculation wouldn’t help classify the text. It often ends up creating ‘false positives’ and ‘false negatives’. Therefore, we need the division of the data set in three submetrics as given above. This will help reduce the unintended bias.

Hope this article enriched your knowledge about the unintended bias!