The #paperoftheweek 44 was Sorting out Lipschitz Function Approximation

Source: Deep Learning on Medium


Go to the profile of Brighter

This work shows what are the necessary requirements to implement neural networks subject to a Lipschitz constraint without losing expressiveness. With this goal in mind, the authors propose a gradient norm preserving activation function during backpropagation, called Group-Sort, with additionally constraining the weight matrices. Among their results, the authors show an improvement in adversarial robustness using a hinge loss with a custom margin related to the Lipschitz constant.

Abstract:
Training neural networks subject to a Lipschitz constraint is useful for generalization bounds, provable adversarial robustness, interpretable gradients, and Wasserstein distance estimation. By the composition property of Lipschitz functions, it suffices to ensure that each individual affine transformation or nonlinear activation function is 1-Lipschitz. The challenge is to do this while maintaining the expressive power. We identify a necessary property for such an architecture: each of the layers must preserve the gradient norm during backpropagation. Based on this, we propose to combine a gradient norm preserving activation function, GroupSort, with norm-constrained weight matrices. We show that norm-constrained Group- Sort architectures are universal Lipschitz function approximators. Empirically, we show that norm-constrained GroupSort networks achieve tighter estimates of Wasserstein distance than their ReLU counterparts and can achieve provable adversarial robustness guarantees with little cost to accuracy.

You can find the entire article here: https://arxiv.org/abs/1811.05381