Source: Deep Learning on Medium
To be able to find what the model has memorized, we need to remove the adversarial inputs from the model, or at least decrease them. There’s tons of research into how to best defend against adversarial attacks but we found adversarial training to be the most effective defense. The premise for this defense is to attack the model each iteration and use the adversarial examples as the training data. After training the model has learned how to classify adversarial examples correctly and no longer misclassifies adversarial examples within a certain perturbation radius.
We trained 3 different models, two traditionally trained (VGG16, ResNet)and one adversarially trained (ResNet) on the CIFAR-10 dataset, then attacked them with a projected gradient descent attack. The results are shown below but, the reconstructions are night and day. By decreasing the number of adversarial examples through adversarial training, the reconstructions become more clear and detailed and the images that the model memorized can be more easily identified.
One of the interesting difference in memorization of deep learning models compared to shallow models, is the quantity of samples that are memorized. In the previous post we discussed how the model memorized the average dataset of a class. With deep learning the model has enough capacity to both cluster and memorize. This means that we can generate higher fidelity images that are not blurred because of variations in pose, and we can generate a greater number of samples as shown below.
The next question might be what are the actual data points that are being memorized. To do this we looked at the cosine similarity between the final output from the convolutional layers of our target model between our reconstructed images and all of the images in the CIFAR-10 dataset. It was here that we learned that the strange blue long necked bird we generated is actually real, and called a Cassowary and there are numerous samples of this bird in the dataset. Generated images from traditionally trained models obtain unstructured similar samples, suggesting that not only are adversarial trained models generating higher fidelity images but they are also more focused on particular samples or clusters.
Inverting facial expressions
Extracting images from a dataset of animals and objects may not be too concerning. Most people are more than happy to show images of their dogs or cats but what about individual faces. These same methods can be extended to models trained on facial images. We tested this out on the Kaggle facial expression recognition challenge. This challenge consisted of facial images grouped into 7 different classes of emotion (angry, disgust, fear, happy, sad, surprise and neutral). The model is not trained to learn any specific individual in the dataset but instead how to classify different emotions, yet we observed that in some cases the model memorizes some individuals more than others. We trained the same models as before, one adversarially trained and one trained traditionally. The results are shown below.
Although the facial reconstructions are not perfect they do provide information about the original individuals. As models continue to improve and training techniques advance models are becoming more advance and as such new vulnerabilities will exist. Adversarial training is one example of this, that by making models more robust they are easier to interpret and therefore their secrets can be better interpreted. If you are interested in diving deeper check out our paper.
 M. Fredrikson, S. Jha, and T. Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1322–1333. ACM, 2015.
 F. A. Mejia, P. Gamble, Z. Hampel-Arias, M. Lomnitz, N. Lopatina, L. Tindall, & M. A. Barrios. Robust or Private? Adversarial Training Makes Models More Vulnerable to Privacy Attacks. arXiv preprint arXiv:1906.06449. 2019
 R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 3–18. IEEE, 2017.