Deep Learning has been praised to solve everything from self-driving cars to the world-climate. And yet, deep neural networks (the workhorse of Deep Learning) fail to satisfactorily solve even the most mundane of tasks: robust handwritten digit recognition. Consider the following examples:

The number below each digit shows the network’s prediction. It classifies all of these samples correctly. So what’s the problem? Well, consider the following images:

We modified the images just slightly, but now the neural network misclassifies all of them. These kinds of “adversarial” inputs have been known for many years. They affect basically every Deep Learning application from object recognition, semantic image segmentation, speech recognition to spam filtering. Pretty much every single neural network currently deployed is affected and could be attacked (including e.g. Siri or Amazon Echo).

But it gets even worse: consider the following set of images:

Do you recognise even the hint of a handwritten digit? No? The neural network is extremely certain that these are all zeros. These so-called unrecognisable images highlight just another problem with today’s neural networks: they behave completely erratic if the inputs are too far away from the “normal” data (in this case *noise* instead of *digits*).

This robustness problem has been recognised by many as one of the major road blocks towards the deployment of Deep Learning. Not only because of security reasons, but because these failures highlight that we have no clue how neural networks really operate and which image features they use for classification. The number of papers trying to solve this problem strongly increased throughout the last two years, but so far to no avail. In fact, the neural network that we used to classify the handwritten digits above is currently recognised as the most robust model (Madry et al.). This fact highlights just how far away we are from robust recognition models – even for simple handwritten digits.

In our recent paper, we introduce a new concept to classify images robustly. The idea is very simple: if an image is classified as a seven, than it should contain roughly two lines – one shorter, one longer – that touch each other at one end. That’s a generative way to think about digits, which is pretty natural for humans and which allows us to easily spot the signal (the lines) even amidst large amounts of noise and perturbations. Having such a model should make it easy to classify the adversarial examples featured above into the correct class. Learning a generative model of digits (say zeros) is pretty straightforward (using a Variational Autoencoder) and, in a nutshell, works as follows: we start from a latent space of nuisance variables (which might capture things like thickness or tilt of the digit and are learnt from the data) and generate an image using a neural network. We then show examples of handwritten zeros and train the network to produce similar ones. At the end of training, the network has learnt about the natural variations of handwritten zeros:

We learn such a generative model for each digit. Then, when a new input comes along, we check which digit model can best approximate the new input. This procedure is typically called *analysis-by-synthesis*, because we *analyse* the content of the image according to the model that can best *synthesise* it. Standard feedforward networks, on the other hand, have no feedback mechanisms to check whether the input image really resembles the inferred class:

That’s really the key difference: feedforward networks have no way to check their predictions, you have to trust them. Our analysis-by-synthesis model, on the other hand, looks whether certain image features are really present in the input before jumping to a conclusion.

We do not need a perfect generative model for this procedure to work. Our model of handwritten digits is certainly not perfect: look at the blurry edges. Nonetheless, our model can classify hand-written digits with high accuracy (99,0%) and its decisions make a lot of sense to humans. For example, the model will always signal low confidence on noise images, because they don’t look like any of the digits it has seen before. The images closest to noise that the analysis-by-synthesis model still classifies as digits with high confidence make a lot of sense to humans:

In the current state-of-the-art model by Madry et al. we found that minimal perturbations of clean digits are often sufficient to derail the classification of the model. Doing the same for our analysis-by-synthesis model yields strikingly different results:

Note that the perturbations make a lot of sense to humans and it is sometimes difficult to decide into which class the image should be classified. That’s exactly what we expect to happen for a robust classification model.

Our model has several other notable features. For example, the decisions of the analysis-by-synthesis model are much easier to interpret as one can directly see which features sway the model towards a particular decision. In addition, we can even derive some lower bounds of its robustness.

The analysis-by-synthesis model does not quite match human perception yet and there is still a long way to go (see the full analysis in our manuscript). Nonetheless, we believe these results are extremely encouraging and we hope that our work will pave the way towards a new class of classification models that are accurate, robust and interpretable. We still have to learn a lot about these new models, least of all how to make inference more efficient and how to scale them to more complex data sets (like CIFAR or ImageNet). We are working hard to answer these questions and are looking forward to sharing more results with you in the future.

#### Towards the first adversarially robust neural network model on MNIST

Lukas Schott, Jonas Rauber, Matthias Bethge, Wieland Brendel

arXiv:1805.09190

Source: Deep Learning on Medium