Original article was published on artificial intelligence
A UCD student’s research has resulted in the withdrawal of an 80-million image library used to train artificial intelligence systems.
he research by PhD student Abeba Birhane found that hundreds of millions of images in academic datasets that are used to develop AI systems and applications are partly based on racist and misogynistic labels and slurs, according to the Irish Software Research Centre (Lero) and University College Dublin’s Complex Software Lab.
“Already, MIT has deleted its much-cited ‘80 Million Tiny Images’ dataset, asking researchers and developers to cease using the library to train AI and ML system,” said the software research centre in a statement.
“MIT’s decision came as a direct result of the research carried out by University College Dublin based Lero researcher Abeba Birhane and Vinay Prabhu, chief scientist at UnifyID, a privacy start-up in Silicon Valley.”
In the course of the work, the Lero statement says, Ms Birhane found the MIT database contained thousands of images labelled with racist and misogynistic insults and derogatory terms.
This “contaminates” the AI databases, Ms Birhane said.
“Face recognition systems built on such dataset embed harmful stereotypes and prejudices,” she said.
“Not only is it unacceptable to label people’s images with offensive terms without their awareness and consent, training and validating AI systems with such dataset raises grave problems in the age of ubiquitous AI. When such systems are deployed into the real-world, in security, hiring, or policing systems, the consequences are dire, resulting in individuals being denied opportunities or labelled as a criminal. More fundamentally, the practice of labelling a person based on their appearance risks reviving the long discredited pseudoscientific practice of physiognomy.”
There are many datasets around the world that might be affected, she said.
“Lack of scrutiny has played a role in the creation of monstrous and secretive datasets without much resistance, prompting further questions such as what other secretive datasets currently exist hidden and guarded under the guise of proprietary assets?”
The researchers also found that all of the images used to populate the datasets examined were “non-consensual” images, included those of children, scraped from seven image search engines, including Google.
“From the questionable ways images were sourced, to troublesome labelling of people in images, to the downstream effects of training AI models using such images, large scale vision datasets may do more harm than good,” said Ms Birhane.
“We believe radical ethics that challenge deeply ingrained traditions need to be incentivised and rewarded in order to bring about a shift in culture that centres justice and the welfare of disproportionately impacted communities. I would urge the machine learning community to pay close attention to the direct and indirect impact of our work on society, especially on vulnerable groups,” Ms Birhane concluded.