Original article was published by Andreas Messalas on Artificial Intelligence on Medium
This tweet went viral with (currently) 81K retweets, almost 200K likes and was covered in articles by BBC and CNN. It also got the attention of many users who posted different configurations of images with black and white people and trying to verify themselves whether there is truly bias in Twitter’s preview selection model. Some even tried posting images with white and black dogs as well as cartoon characters.
Twitter’s official reply was: “We tested for bias before shipping the model & didn’t find evidence of racial or gender bias in our testing. But it’s clear that we’ve got more analysis to do. We’ll continue to share what we learn, what actions we take, & will open source it so others can review and replicate.”
In Code4Thought, we are deeply concerned with bias and discrimination in algorithmic systems, especially when these systems can crucially affect real people. So we did our own testing with our fairness and transparency service called Pythia and we investigated if Twitter’s model is truly bias-free as Twitter’s official reply suggested.
In order to have a more systematic approach, we used a specialized dataset containing images of faces of different racial groups, which is balanced for all groups. To keep things simple we used only adult black and white males for our experiments and we constructed a new dataset containing combined photos — collages — of adult black males at the top, white adult males in the bottom and a white background between them. Our new dataset contained 4,009 pictures of black and white adult males, which we uploaded in an account on Twitter called @bias_tester. Finally, we manually labeled the preview photo of each tweet as ‘Black male’ if Twitter’s underlying model selected the black male, otherwise we labeled it as ‘White male’.
Using Code4Thought’s Pythia service on the new labeled dataset containing the 4,009 collage-photos and the preview label (‘Black male’, ’White male’), we examined Twitter’s model on fairness and transparency.
The metric we choose to measure fairness is Disparate Impact Ratio (DIR), which basically measures how differently the model behaves across different groups of people — in our case black and white adult males. More specifically it is the proportion of individuals that receive a positive outcome for two groups.
If there is great disparity in the model’s outcome for each group, then we can claim that there might be bias in the model. According to “4–5ths rule” by the U.S. Equal Employment Opportunity Commission (EEOC), any value of DIR below 80% can be regarded as evidence of adverse impact. Since DIR is a fraction and the denominator might be larger than the numerator, we consider an acceptable range of DIR from 80% to 120%.
Twitter allows its users to upload a certain amount of tweets per day, so we sent batches of 300 tweets of our collages every 3 hours. After each batch was uploaded, we manually measured the number of black and white males in Twitter’s preview photo and sent this data to our Pythia platform.
After all batches were sent, the total DIR was 0.61, which is much less than the accepted threshold. This analysis suggests that Twitter’s preview photo selection model is more likely to choose a white male than a black male. We can observe that, while some batches of data (blue line) were compliant, the total DIR (orange line) was continuously not compliant, which is an indication of bias towards black males.
Using explanations to verify bias
We would like to get a sense of how the underlying model “thinks” and try to understand its decision process, in order to find reasoning for the discovered bias of the fairness evaluation. We modified our existing dataset by using the individual images of the same black and white adult males from the collages, and labeled them with 1 if the photo was selected in the preview, otherwise we labeled them with 0. We used Pythia’s model-agnostic explainer , which utilizes surrogate models (i.e. models that try to mimic the original model) and Shapley values, in order to explain the predictions of the Twitter’s preview selection model, even without having direct access to it (more info about our method can be found in this corresponding paper).
Clearly the size of our dataset (8,018 pictures) is not enough to give us information in a confident way about how Twitter’s internal model makes its predictions, however it can give us already some insights.
Below are some examples of explanations from our dataset. The green and red pixels in the grayscale image demonstrate positive and negative contributions correspondingly towards selecting the image as the preview photo.
Since we do not know what is the goal of Twitter’s preview selection model (e.g. facial recognition), it is difficult to understand the explanations. On a first glance, we might observe that the model tries to identify facial characteristics such as the eyes, nose and cheeks.