Original article was published on Deep Learning on Medium
Motivation: erring is telling
So you want to train a neural network to distinguish puppies from people. Maybe you’d like to train a system that opens the door when your little puppy arrives but keeps strangers out, or you are the owner of an animal farm where you want to make sure that only people can get into the house.
In any case, you take your favourite CNNs (say, ResNet-152 for performance and AlexNet for good old times’ sake) and train them on a puppies-vs-people dataset scraped from the web. You are relieved to see that of them reach about 96-98% accuracy. Lovely. But does similar accuracy imply similar strategy?
Well, not necessarily: even very different strategies can lead to very similar accuracies. However, those 2–4% that the networks got wrong carry a lot of information about their respective strategies. Suppose AlexNet and ResNet both made an error on the following image by predicting “person” instead of “puppy”:
Now this error would be just as sad as this dog looks like, especially since many other puppies were recognised perfectly well, like these here:
Looking at some more images that the networks failed to recognise, you start forming a suspicion: Could it be the case that both networks implemented the classification strategy “whatever wears clothes is a person”? You’ve always suspected AlexNet to be a bit of a cheat — but what about you, ResNet? Is it too much to ask for a bit more depth of character from someone with 152 layers?
Closer inspection confirms: conversely, the networks also misclassified a few people as “puppies” —we leave it to the reader’s imagination how these images may have looked like if the model’s decision strategy relies on the degree of clothing.
Erring is telling, and we can exploit this property: if two systems (e.g. two different CNNs) implement a similar strategy, they should make errors on the same individual input images — not just a similar number of errors (as measured by accuracy) but also errors on the same inputs: similar strategies will make similar errors. And this is exactly what we can measure using error consistency.
Introduced in our recent paper, trial-by-trial error consistency assesses whether two systems systematically make errors on the same inputs (or trials, as this would be called in psychological experiments). Call it an analysis based on trial and error if you like.
Leaving the hypothetical toy dataset (puppies vs. people) aside, how similar are the strategies of different CNNs trained on a big dataset (ImageNet)? And are they similar to human errors on the same data? We simply went to the animal, pardon, model farm and evaluated all ImageNet-trained PyTorch models to obtain their classification decisions (correct responses vs. errors) on a dataset where we also have human decisions for comparison. Here’s what we’ve discovered.