ICLR 2020 | A Look at Three Interesting Papers on the Robustness of Neural Networks

Original article was published on Artificial Intelligence on Medium

Models trained adversarially achieved the best performance in this robustness test. The JEM does improve the model robustness to some extent. It is clear here that increasing the number of refinement steps can increase robustness, but I am curious about how much speed loss or computation cost it takes to improve robustness? Based on the above figure, the marginal improvement of the robustness on adverbial attacks seems to be limited.

This paper finds a smart way to combine the discriminative and generative models instead of separating them and, thus, is able to enjoy the benefits of both. The parametrization is neat and clean. Some applications, such as error calibration, are especially interesting.

I have a few more questions about the optimization — as the authors mentioned, the sampling is quite hard — would this narrow the application of JEM? Or in other words, is there a significant performance drop when then dimensionality increases? I am also wondering how other generative models, like Bayesian neural networks, perform compared with JEM. Of course, it is unlikely to cover a compressive range of generative models as there are simply too many.

The code of this paper has been released on GitHub (https://github.com/wgrathwohl/JEM), and both the paper and code are interesting to go through.

Paper 2: Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Paper URL: https://openreview.net/pdf?id=SJeLIgBKPS

The second paper also investigates the robustness of neural networks from the perspective of the optimization algorithm, based on existing studies of logistics regression and neural networks.

Instead of proposing a new algorithm, this approach focuses on the current workflow: most of the neural networks are trained with gradient descent algorithms or variants. One question is whether these algorithms are biased towards solutions with good generalization performance on the test set?

To answer this question, the paper first studies the margin in classical machine learning algorithms. Taking a support vector machine (SVM) algorithm as an example, consider a linearly separable dataset that consists of only two classes, as blue and red points show in the figure below.

Figure: a linearly separable dataset consists of only 2 classes (blue and red points) and possible decision boundaries
Source: courses slides of C19 Machine Learning, link: http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf

The black lines in the figure indicate decision boundaries. For the example given here there exist infinite decision boundaries that can completely separate the two classes, and the question is how to decide which one is better. For the decision boundaries shown in the first row, a large distance to the class with red points is retained, but a small permutation in the class of blue points can result in a false classification as the decision boundary is so close to this class.

The decision boundary shown in the figure in row 2 column 1 is so close to both classes that permutation on the points around the decision will cause false classification. This situation can be characterized by the L2-distance from the training sample to the decision boundary, which the paper refers to as L2-robustness. A large distance between the decision boundary and the training point can bear stronger perturbations.

Typically, the decision boundary can be calculated by seeking the largest margin, which is a hyperplane that describes the separation between the two classes. Based on SVM, the margin becomes the largest when the decision boundary is perpendicular to the hyperplane formed by the support vectors.

Figure: SVM searches for w that maximizes the margin
Source: courses slides of C19 Machine Learning, link: http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf

It is natural to adapt the concept of margin in the context of neural networks. However, unlike linear models, where the margin is solely defined by the decision boundary for non-linear neural networks, the margin q_n(θ) for a single data point in (x_n, y_n) is defined as:

q_n(θ) := y_n Φ(θ; x_n),

where Φ(θ; x_n) stands for the output of the neural network.

In this case, the margin is dependent on the function Φ that the neural network learned from the training set. The paper argues, though, that if a neural network is locally Lipschitz then the normalized margin divided by a Lipschitz constant is the lower bound of the L2robustness. A function that satisfies Lipschitz is bounded by Lipschitz constant on how fast it can change, as the Lipschitz constant is larger than the absolute value of the slope of the line connecting any pair of points on the graph of this function. According to the paper, previous studies have shown that “the output of almost every neural network admits a chain rule (as long as the neural network is composed by definable pieces in an o-minimal structure, e.g., ReLU, sigmoid, LeakyReLU).”

For the analysis on gradient flow, only local Lipschitz is required. While for analysis on gradient descent, C² smoothness is required, which is a stronger assumption because it requires continuity on first and second derivatives.

There are a few other important assumptions that the authors have made — homogeneity, exponential-type loss function, and separability — to achieve their performance.

The first assumption is that the neural networks to be investigated are (positively) homogeneous neural networks, which is considered to be true if there exists an order L, which is a positive number, such that neural network output Φ(θ; x) satisfies the following condition:

for all θ and x, where Φ stands for the neural network, θ stands for the network parameters and x stands for input. When bias term is removed, fully connected neural networks and CNNs that use ReLU or LeakyReLU as activations are homogeneous, and the order L equals to the number of layers.

The assumption of Exponential-type Loss Function requires the loss function to have some certain exponential tail but not a strictly exponential loss, including popular exponential loss, logistic loss, and cross-entropy loss.

The neural network should be able to separate training data during training, as a neural network can reach 100% accuracy on training data. The authors explain that it is not required that the training loss be zero, but rather only below a threshold, after time t_0, such that the accuracy is 100%. This assumption is needed to analyze the margin given its definition. As modern neural networks are often over-parameterized, the paper argues it should not be too hard to satisfy this requirement. The analysis is focusing on the training dynamics of the network after t_0.

Without too many details on the reasoning, the main findings are as listed below:

  1. It is possible to define a smoothed version of a normalized margin, and it can be proved that this margin is non-decreasing during training for both gradient flow and gradient descent.
  2. A constrained margin maximization problem is constructed and it is shown that the optimization, both gradient flow, and gradient descent, passes through an approximate KKT point after a certain amount of time. However, the KKT conditions for highly non-linear neural networks are only first-order necessary and not sufficient for global optimality. Convergence to a KKT point does not lead to a normalized margin that is globally optimal. The paper also clarifies that it is not realistic to prove stronger convergence results without adding further assumptions since gradient descent is a first-order optimization method.
  3. Practically, the paper indicates that training with more time can enlarge the normalized margin and thus improve the (L2-) robustness of the neural network.

The paper presents an empirical test with the result plotted in the figure below. It is shown that the normalized margin increases with the training epochs. But when training with a fixed time step, the normalized margin increases very slowly(≈ 1.8 × 10^{−4} after 10000 epochs), as shown in figure (a). By instead applying the SGD algorithm with a loss-based learning rate scheduler, the normalized margin increases much faster (≈ 1.2 × 10^{−3} after 10000 epochs). The loss-based learning rate scheduler is proposed by the authors to find the maximum possible learning rate at each epoch given the current training loss.

Figure: Training loss and normalized margin during training a CNNs, with and without bias on MNIST. The optimizer is SGD. (a) with a fixed learning rate 0.01, (b) with loss-based learning rate scheduler.
source: Lyu, K. and Li, J. (2020). Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. ICLR.

Let’s further examine the training process with the loss-based learning rate scheduler. The table below lists several snapshots noted as model-1, model-2, etc., and their performance. The training accuracy of all five models is 100%. This is expected since the paper only focuses on the training stage that the neural network has perfectly fitted on the training data. The normalized margin increases when the training proceeds, but the test accuracy drops slightly.

Table: Statistics of the CNN without bias after training for a different number of epochs. Model-1 to model-4 are the snapshots taken when the loss decreases below 10^{−10}, 10^{−15}, 10^{−20}, 10^{−120}, respectively. Model-5 is the snapshot when the loss is 10^{−882}.
source: Lyu, K. and Li, J. (2020). Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. ICLR.

What is a bit confusing to me is the apparent contradiction between the two observations — training longer can improve the robustness and training longer might result in overfitting. It should be noted, though, that training longer does not necessarily lead to overfitting. However, for a neural network that achieves 100% training accuracy on a complicated real-world dataset, more training will increase its risk of overfitting. Thus, this contradictory conclusion seems a bit confusing.

Also, the relationship between the margin and the L2-robustness remains unclear to me. Allow me to quote a comment from a reviewer to explain: “As the paper argues if we have an upper bound on the Lipschitz constant w.r.t. norm, then we get a lower bound on required adversarial perturbations for any training point. However, this does not mean that training longer is necessarily better because by doing so, we might end up with a larger Lipschitz constant (even after normalizing). So even if the ‘margin’ is larger, the actual adversarial perturbations (in norm) allowed might get smaller.”

Another issue in this paper is that some of its assumptions are too strong, like the separability and the C² smoothness. The paper, though, argues that many neural network models with commonly used losses naturally satisfy them. There are other assumed preliminaries, such as bias being not included, which are not easily met in practice.

The contribution of this paper is not negligible. It formulates an optimization problem without assumption of the convergence on the direction of parameters and losses. This paper also provides some theoretical guidelines on the choice of optimization algorithms. The code for conducting the experiments has been published on https://github.com/vfleaking/max-margin.

Paper 3: On Robustness of Neural Ordinary Differential Equations

Paper URL: https://arxiv.org/abs/1910.05513

The idea of this paper is similar to the first paper. It also improves the robustness with minimal adaptation on current neural network architecture. This paper focuses on Neural Ordinary Differentiable Equation (ODE) networks. The ODENets are developed according to the observation that some neural networks, such as residual networks and recurrent neural networks, approximate nonlinear functions by a sequence of transformations through a hidden state h:

where t stands for the time step and f(h_t, θ_t) are trainable layers that learn hidden unit dynamics.

Usually this model is based on a discrete fixed time step. One intuitive idea is to parameterize the continuous dynamics of hidden units by adding more layers and taking smaller steps. In the limit, this forms an ODE that describes the relationship between input and output as:

where f_θ refers to trainable layers and z refers to the d-dimensional state of the neural ODE, which is equivalent to the h_t described in the equation above. The input state/initial state of the neural ODE is specified by z_in. The output z_out is related to the state at time point R.

This makes ODENets similar to generative models, which suggests they could improve robustness. The paper illustrates a network architecture that contains 3 stages. The first stage is a feature extractor made from a series of convolutional layers. The neural ODE that performs non-linear representation mapping forms the second stage. And the last stage consists of a fully-connected classifier that outputs the final prediction. A feature extractor is necessary to perform dimensionality reduction because ODENets are dimension-preserving mappings.

Figure: the architecture the ODENet used in the paper.
Source: Yan, H.; Du, J.; Tan, Y. F. V.; Feng, J. (2020). On Robustness of Neural Ordinary Differential Equations. ICLR.

The paper evaluated the robustness of the ODENets with inputs of various types of perturbations, such as random Gaussian perturbations, fast gradient sign method (FGSM) adversarial attack, and projected gradient descent (PGD) adversarial attack. The PGD attack is also called I-FGSM, where the I stands for iterative.

A CNN model based on the ResNet architecture was trained and tested as a baseline model for comparison. Both the ODENets and the CNN model were trained with two strategies: training with only original non-perturbed images, and training with original images together with their perturbed versions. Training datasets include MNIST, SVHN, and a subset of the ImageNet dataset, which the paper refers to as ImgNet10.

The evaluation results of models trained with only original images are listed below. Each model was trained three times with different random seeds. The paper uses the mean classification accuracies (%) and standard deviations (mean ± std) to indicate robustness performance. The Gaussian noise is generated with zero mean and three different standard deviations, also listed in the table below. The adversarial attack used is FGSM, with 3 different l_∞-norm ε as shown in the table.

Table: Robustness test on ODENet and benchmark CNN, trained with only original images.
Source: Yan, H.; Du, J.; Tan, Y. F. V.; Feng, J. (2020). On Robustness of Neural Ordinary Differential Equations. ICLR.

It is obvious that ODENet outperforms CNN in every test. The difference in classification accuracy under adversarial attack is especially large. However, the performance of ODENet is not very competitive in comparison with the baseline model. The performance of both models trained on ImgNet10 dropped the most under the attack. One drawback is that the standard deviation is only calculated on three repetitions, which is not very statistically trustworthy.

The ODENet continues to outperform CNN when models are trained together with Gaussian noise, as shown in the table below. The standard deviation of the added Gaussian noise is randomly chosen from {50, 75, 100} on the MNIST dataset, {15, 25, 35} on the SVHN dataset, and {10, 15, 25} on the ImgNet10. During the robustness test, Gaussian noise is drawn from a distribution with zero mean and standard deviation 100, both FGSM and PGD adversarial examples are used, with different l_∞-norm ε as listed in the table below.

Table: Robustness test on ODENet and benchmark CNN, trained with original images together with their perturbed versions.
Source: Yan, H.; Du, J.; Tan, Y. F. V.; Feng, J. (2020). On Robustness of Neural Ordinary Differential Equations. ICLR.

Adding Gaussian noise during training improves the robustness of both models. The PGD attack is more powerful than previous attacks, and under this attack the largest performance drop is observed. This is especially true when looking at the CNN performance on MNIST under the PGD-0.3 attack. CNN completely failed to distinguish the adversarial examples and resulted in 0% accuracy on all 3 tests. ODENet, on the other hand, performs better but is not entirely satisfying either.

On the other hand, it is observed empirically that ODENet is more robust, but how to explain this observation? The paper argues that this is because, in the neural ODE, a small change on the feature map will not lead to a large deviation from the original output associated with the feature map.

To verify this idea, the paper demonstrated that ODE integral curves do not intersect. For example, consider a simple model that returns a scalar as its state. Assume there exists a few initial states z_1 (0), z_2 (0) and z_3 (0) such that the system starts from A_1 = (0, z_1 (0)), A_2 = (0,z_2(0)), and A_3 = (0,z_3(0)). Suppose A1 is bounded by A2 and A3, as shown in the figure below. Then the non-intersect theorem will ensure that the integral curve z_1(t) is always sandwiched between the integral curves z_2(t) and z_3(t).

Therefore, it is possible to find another starting point \tilde{A} that is in the neighbourhood of A_1. The distance between \tilde{A} and A_1 is smaller than both the distance between A_1 and A_2 and between A_1 and A_3, of A_1. This means that the integral curve \tilde{z}_1 (t) would also be bounded. In the figure below, it is sandwiched between two integral curves starting from A1 and A3.

Figure: No integral curves intersect.
Source: Yan, H.; Du, J.; Tan, Y. F. V.; Feng, J. (2020). On Robustness of Neural Ordinary Differential Equations. ICLR.

It is important to note here that the function f must be continuous in time t and globally Lipschitz continuous in state z. Otherwise one can easily find situations when this theorem does not hold. The paper has also implicitly assumed that feature maps will not change drastically under a small perturbation on inputs. This, in turn, requires that regularization techniques, such as weight decay, are used.

Another question relates to the bound of the distance between the integral curves. If this distance is not bounded or very large, then the robustness is not guaranteed even if the theorem holds.

In further analysis, the paper also acknowledges that the key to improving the robustness of ODENets is to control the difference between neighbouring integral curves. They propose to achieve this by (1) removing the time dependence constraint of the dynamics and (2) imposing a certain steady-state constraint. More specifically, the dependency on time step t shown in f_θ (z(t), t) should be removed, which results in a re-parametrized neural ODE:

Define a set M_1 that includes the points that satisfy {(z_1(t), t) | t ∈ [0, T], || z_1(t) − z_1(0) || ≤ ε}, where z_1(t) is a solution of the re-parametrized neural ODE and ε is a small positive value. This set M_1 describes the points that are inside the ε-neighbourhood of z_1 (0) as well as on the curve of z_1(t) during time interval [0, T]. Then for element (z_1 (T ′ ), T ′ ) that belongs to set M_1, let \tilde{z}_1 (t) be a solution of the re-parametrized neural ODE equation whose initial state is \tilde{z}_1 (0) = z_1(T′). The element (z_1 (T ′ ), T ′ ) enjoys the time-invariant property, namely:

This property enables the possibility to bound the difference between neighbouring integral curves, providing that 􏰃\tilde{z}_1 (0) is regarded as a slightly perturbed version of z1(0):

where all norms are l_2 norms and |f_θ| refers to the element-wise absolute operation of a vector-valued function f_θ. Hence, by explicitly including the final term shown in the equation above, called steady-state loss, in the training loss, the robustness becomes one of the training objectives:

where N refers to training data size and the initial state of z_i(t) equals to the feature of the ith sample. The training aims at finding a small L_sss such that the robustness against small permutations is achieved. This model is called Time-invariant steady neural ODE (TisODE).

Similar tests were done on TisODE, with results shown in the table below. At the end of the training, the steady-state loss L_ss dropped to 0.1. Models were trained with both original images and images with Gaussian noise.

Table: Robustness test on ODENet, TisODE, and benchmark CNN, trained with original images and Gaussian noise.
Source: Yan, H.; Du, J.; Tan, Y. F. V.; Feng, J. (2020). On Robustness of Neural Ordinary Differential Equations. ICLR.

Indeed, the robustness of the TisODE is improved, especially under the FGSM attack. Yet this improvement still seems to be limited under a PGD attack. For example, the accuracy only increased by 0.2% when combating PGD-0.3 on the MNIST dataset.

It is an intuitive idea to test models trained with adversarial examples. The paper tested it on the MNIST dataset. Models were trained with the FGSM attack (ε=0.3). The results are shown in the table below.

Table: Robustness test on ODENet, TisODE, and benchmark CNN, trained with original images and adversarial examples.
Source: The authors’ comments on openreview.net

Since the models are trained with adversarial examples, they are much more robust under the FGSM 0.3 attack than under Gaussian noise permutation, which was not observed in previous experiments. Performance drops significantly when applying other attacking methods, especially with PGD 0.3. TisODE though, does show a satisfying result. Using adversarial training is still the most effective way to improve robustness. However, improved robustness does not necessarily generalize well under other attacks.

Overall, the idea of ODE itself is very interesting. And it seems especially suited for generative problems, such as time-series analysis. Although the reasoning on ODENet’s robustness is not so rigorous, the creation of TisODE is innovative. A big advantage of this work is that the proposed model can be regarded as a drop-in module that can be easily combined with other approaches to improve the robustness of a trained neural network.

Conclusion

Given the empirical results discussed in papers 1 and 3, we see that neither proposes a more effective way to improve the robustness of neural networks — the state-of-the-art method is still adversarial training. Although some of the ‘side-products’, such as error calibration from JEMm are very interesting approaches to provide additional information regarding robustness.

Overall, there is not yet a unified or effective theoretic explanation on robustness and how to guide model training. We have experienced this ‘saturation’ in many different research fields in the past two years. For example, network architecture has not witnessed a revolutionary change since the proposal of ResNet. On the one hand, this suggests that deep learning techniques are becoming mature and provide more possibilities for industrial development. On the other hand, it seems innovative and impactful work is being seen less. Luckily, we are in a field where industrial engineers can also contribute to state-of-the-art research.

In my opinion, we, the practitioners, should also ask ourselves questions — how could these theoretical findings guide our practical work? And how could our empirical observations provide new insights to research?