Original article was published on Artificial Intelligence on Medium

Models trained adversarially achieved the best performance in this robustness test. The JEM does improve the model robustness to some extent. It is clear here that increasing the number of refinement steps can increase robustness, but I am curious about how much speed loss or computation cost it takes to improve robustness? Based on the above figure, the marginal improvement of the robustness on adverbial attacks seems to be limited.

This paper finds a smart way to combine the discriminative and generative models instead of separating them and, thus, is able to enjoy the benefits of both. The parametrization is neat and clean. Some applications, such as error calibration, are especially interesting.

I have a few more questions about the optimization — as the authors mentioned, the sampling is quite hard — would this narrow the application of JEM? Or in other words, is there a significant performance drop when then dimensionality increases? I am also wondering how other generative models, like Bayesian neural networks, perform compared with JEM. Of course, it is unlikely to cover a compressive range of generative models as there are simply too many.

The code of this paper has been released on GitHub (https://github.com/wgrathwohl/JEM), and both the paper and code are interesting to go through.

# Paper 2: Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Paper URL: https://openreview.net/pdf?id=SJeLIgBKPS

The second paper also investigates the robustness of neural networks from the perspective of the optimization algorithm, based on existing studies of logistics regression and neural networks.

Instead of proposing a new algorithm, this approach focuses on the current workflow: most of the neural networks are trained with gradient descent algorithms or variants. One question is whether these algorithms are biased towards solutions with good generalization performance on the test set?

To answer this question, the paper first studies the margin in classical machine learning algorithms. Taking a support vector machine (SVM) algorithm as an example, consider a linearly separable dataset that consists of only two classes, as blue and red points show in the figure below.

The black lines in the figure indicate decision boundaries. For the example given here there exist infinite decision boundaries that can completely separate the two classes, and the question is how to decide which one is better. For the decision boundaries shown in the first row, a large distance to the class with red points is retained, but a small permutation in the class of blue points can result in a false classification as the decision boundary is so close to this class.

The decision boundary shown in the figure in row 2 column 1 is so close to both classes that permutation on the points around the decision will cause false classification. This situation can be characterized by the L2-distance from the training sample to the decision boundary, which the paper refers to as L2-robustness. A large distance between the decision boundary and the training point can bear stronger perturbations.

Typically, the decision boundary can be calculated by seeking the largest *margin*, which is a hyperplane that describes the separation between the two classes. Based on SVM, the margin becomes the largest when the decision boundary is perpendicular to the hyperplane formed by the support vectors.

It is natural to adapt the concept of margin in the context of neural networks. However, unlike linear models, where the margin is solely defined by the decision boundary for non-linear neural networks, the margin q_n(θ) for a single data point in (x_n, y_n) is defined as:

q_n(θ) := y_n Φ(θ; x_n),

where Φ(θ; x_n) stands for the output of the neural network.

In this case, the margin is dependent on the function Φ that the neural network learned from the training set. The paper argues, though, that if a neural network is locally Lipschitz then the normalized margin divided by a Lipschitz constant is the lower bound of the L2*–*robustness. A function that satisfies Lipschitz is bounded by Lipschitz constant on how fast it can change, as the Lipschitz constant is larger than the absolute value of the slope of the line connecting any pair of points on the graph of this function. According to the paper, previous studies have shown that “the output of almost every neural network admits a chain rule (as long as the neural network is composed by definable pieces in an o-minimal structure, e.g., ReLU, sigmoid, LeakyReLU).”

For the analysis on gradient flow, only local Lipschitz is required. While for analysis on gradient descent, C² smoothness is required, which is a stronger assumption because it requires continuity on first and second derivatives.

There are a few other important assumptions that the authors have made — homogeneity, exponential-type loss function, and separability — to achieve their performance.

The first assumption is that the neural networks to be investigated are (positively) **homogeneous neural networks**, which is considered to be true if there exists an order L, which is a positive number, such that neural network output Φ(θ; x) satisfies the following condition: