TinyImages Dataset and Call for Ethics in AI

Original article was published on Artificial Intelligence on Medium


TinyImages Dataset and Call for Ethics in AI

A submission appeared in OpenReview about large datasets [1]. One of the critical observations int the paper was about the popular dataset TinyImages from MIT. The study revealed that the dataset contains disturbing information. June 29th, MIT permanently withdrew the dataset from their page [2]. In the statement, MIT mentioned:

“The dataset is too large (80 million images) and the images are so small (32 x 32 pixels) that it can be difficult for people to visually recognize its content. Therefore, manual inspection, even if feasible, will not guarantee that offensive images can be completely removed.”

The researchers and MIT revealed that the data set was created from 53,464 different nouns copied from WordNet. The terms were used to download data from the internet. But the words selected fr the search was never verified or filtered for any potential bias or extremely offensive language. The original image was never stored, but a 32x 32-pixel image is saved. The size of the image is a challenging factor in verification.

MIT’s move to withdraw the data, followed by the paper, is welcomed by the community. The brains behind such exciting research are Vinay Uday Prabhu (https://unify.id/) and Abeba Birhane from University College Dublin. The paper is available in ArXiv [3]. The researchers also release the associated source code used for auditing the image dataset in Github [4]. The research report is significant in the current time, where we discuss ethics and AI.

The far-reaching consequences of the unaudited dataset for fairness and bias are many. Data Scientists are highly motivated to use open domain datasets to overcome the cold start problem. While talking such data, they may not be auditing the data for any potential threat in the future. The second most activity is transfer learning from models trained such open domain dataset. It is worth investigating the impact of transfer learning models and unaudited datasets. Software systems relay on the public image may harm the reputation if proper precautions are not taken on time.

How do we prevent such issues from appearing in the future? Well, irrespective of the nature of the organization, academic, not for profit or industrial bodies should consider formulating AI ethics committee. The committee should review each stage of data acquisition. Given the case of TinyImages, the damage could have been prevented at the stage of the data collection strategy. Words and phrases may have different senses over the internet than those found in thesaurus or dictionaries. That was the reason Google CEO Sundar Pichai was once summoned to US congress. The researchers might have filtered highly offensive words and words that are learning towards or representing racial slur.

It is better later than ever. The whole AI/Ml community needs to stand up and perform a similar audit in almost all the open data; image nor not. At the same time, it is our responsibility as researchers to check, remove, and revoke any systems which already leveraging TinyImages in the solution. Let’s work towards better AI for the generations.

[1] https://openreview.net/pdf?id=s-e2zaAlG3I — Accessed on 07/04/2020.

[2] https://groups.csail.mit.edu/vision/TinyImages/ Accessed on 07/04/2020.

[3] Large image datasets: A pyrrhic win for computer vision? https://arxiv.org/abs/2006.16923

[4] https://github.com/vinayprabhu/Dataset_audits