With much time on hand recently, I decided to really get into deep learning and find out for myself just where the magic is, by doing the fast.ai course.
It’s a good course, has the right balance of forest and trees, and really delivers on the promise. By the end of lesson 2, I’m re-running and testing code. When time came for lesson 3, I started jumping around, and found a little section in Part II on data ethics.
On this, Jeremy talks about issues like bias in algorithms and makes the point that data scientists can’t ignore data ethics, whether at the professional or personal level, since all this modelling eventually ends up in some kind of product or service for people.
This got me thinking about the conceptual gap between what a data scientist actually does (informed by my very fulfilling first two lessons, which I’m continuing btw) when she trains and validates learning models, which is very much in a different mental realm, compared to what it takes to think about data ethics.
When a data scientist practices deep learning, she first trains the model by preparing data, tweaking parameters as inputs to the algorithm, and validates the outcome. This roughly involves making some clever mechanical tweaks to, say, the input images, such as flipping, shifting, cropping them, to create variability (prevent overfitting), setting a learning rate (a number determining how the algorithm finds a good answer) and other numerical inputs. The validation step then involves evaluating numbers that represent how well the model performs the task, classifying cats and dogs pictures for example. You would also then look at outputs of images that were wrongly classified, to get a feel of why it may have wrongly done so.
In all of this, there’s a lot of heuristics driven, domain neutral decisions, that have got nothing to do with why hate speech should not be promoted. Anyone comfortable with code and with persistence could learn the practice of deep learning, the basic premise of fast.ai. In addition, you could apply the same steps to horses and cows, that is, without assuming a lot of domain knowledge. And here is where the conceptual gap arise.
If the practice of data science on the most successful branch of AI, deep learning, can in large part done through heuristics that requires minimal domain knowledge, it’s difficult to see how you could reason about data ethics in a robust way. But you would instinctively hope that the practitioners would actually be able to reason about data ethics and build that into the models, to prevent the post-apocalyptic skynet scenarious we’ve been asked to entertain.
The problem doesn’t resolve itself if we assume that some ethics expert were to take on the problem. You might build in checks and balances, based on review of bad outcomes, but that’s a hacky workaround. The challenge is twofold 1) can you reason cohesively about biases and things like that in deep learning models, and 2) can you translate those reasons into algorithms that systematically address those biases?
The success of fast.ai means an accessible world of practicing machine learning, a vibrant bazaar of diverse practitioners. AI will find more applications and embed deeper into our lives. Maybe skynet will or won’t become a reality, but algorithmic bias and it’s brethren are very real and needs to be addressed for AI to truly thrive.
Source: Deep Learning on Medium