Highlights of ICLR 2019

Source: Deep Learning on Medium


Computer Vision, Natural Language Processing, Graphs, Scalability, Adversarial Learning, Learning Representations, and Meta-Learning

Go to the profile of Marina Vinyes

ICLR 2019 edition took place in New Orleans and opened with a speech supporting diversity. Among active steps, the organization ensured 100% gender parity in program team, invited speakers and session chairs. The main identified challenges for the community are gender, racial, geographical diversity as well as LGBT+ support. To support geographical diversity ICLR 2020 will be held in Addis, Ethiopia. This will be the first major Machine Learning conference in Africa.

ICLR 2019 organizers

A broad panel of topics was featured during posters and talks, here is a (non-exhaustive) list.

Posters from ICLR 2019

Computer Vision

Computer vision is still one of the main topics in deep learning and was indeed very represented in ICLR. Another topic which was very present as well was image generation. There are many uses for this: art, generating a dataset to learn another task or adversarial examples.

The main difficulties to overcome are generating variety (not just reproducing the input dataset) and coherence (the right number of legs, keeping symmetries, …)

There are several ways to generate images, a very popular one is GAN. Large Scale GAN Training for High Fidelity Natural Image Synthesis presents ways to scale GAN to generate high dimensional images. GANs are hard and expensive to train, so that paper explains what tuning works well to make it work: using lot of resources (500 TPU), adapt the ResNet architecture (more layers and less parameters), change the sampling distribution (Bernoulli and ReLU instead of N(0,1)), regularizing the decoder. The results are really good looking:

Generated samples from Large Scale GAN Training for High Fidelity Natural Image Synthesis

Another way to generate the image is autoregressive networks. Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling is an autoregressive network that introduces some new data transformation: size subscales, depth subscaling, slices. The intuition behind it is that it’s easier to generate small images than bigger images. It is evaluated on CelebAHQ and produce state of the art results:

Generated celebrity images from Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling

The main benchmark in computer vision is still object classification on ImageNet (a 10M images dataset labelled on 10k classes). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness proves that all CNN architectures are biased toward texture. The way CNN are usually explained is that the first layers identify the small shapes, and the next layers start identifying the biggest shapes in the image such as eyes, mouths and legs, and so on.

The authors use style transfer (a style GAN model), to place the textures of one class on images of another class. The model identifies the images as the class of their texture and not the original class. By plotting this bias, the author identified that humans are biased towards shape while CNN are biased toward textures.

To remove the texture bias, they train a CNN on a stylized image net (with random texture). The texture bias is mostly removed and the overall results are improved.

Natural Language Processing

One important task for NLP is semantic parsing: going from text to a semantic language (SQL, logic forms, …). Invited speaker, Mirella Lapata, talked about that in the first talk of the fourth day in Learning Natural Language Interfaces with Neural Models. She introduced a model that can transform natural language to any semantic language. For this purpose, several ideas were combined: a seq2tree architecture to handle the tree structure of semantic languages, a 2 stage architecture to first transform into semantic sketches and then in the target language and training on paraphrases of questions to improve coverage. The results are promising and reach the state-of-the-art for many tasks. The end to end trainable approach is particularly interesting because it avoids the problems of error accumulating in several modules as it is usually the case in NLP.

The second talk of the day was about applying the successful CNN architecture to NLP: it usually fails to reach state of the art, because it cannot handle the structure of language like RNN and attention-based approaches can. Pay Less Attention with Lightweight and Dynamic Convolutions introduces a variant of CNN that can reach state of the art: Dynamic convolution. By testing it on 4 different datasets they prove that this idea can be applied to various problems. They also prove that decreasing the size of the context for self-attention models doesn’t decrease the performance much and decreases computation time, which is the intuition for dynamic convolution: less attention can be enough.

From Pay Less Attention with Lightweight and Dynamic Convolutions

Besides improvement in classical NLP tasks and new architecture, some new tasks gain more popularity, and among those are tasks that need interaction with computer vision. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision proposes a model that automatically learns the concept from image and texts. It introduces an architecture (NSCL) combining semantic parsing, visual parsing, and symbolic reasoning. What it finds is using the resulting concept makes it easier to parse new sentences: the state-of-the-art can be reached with less training data. Visual question answering (i.e. answering questions on images) is one of the tasks this model is evaluated on.

From The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

Graphs

Graph Neural Networks(GNN) are used mainly for solving tasks as node classification or graph classification. In the theoretical paper How Powerful are Graph Neural Networks?, the authors identify graph structures that cannot be distinguished by popular GNN variants, such as GCN and GraphSAGE. They show that most powerful GNN are as powerful as the Weisfeiler-Lehman graph isomorphism test (Weisfeiler and Lehman, 1968) and propose an architecture that reaches the upper bound. The idea is that the upper bound is achieved when the used aggregation function is injective(an example of an injective function is SUM — as opposed to MEAN or MAX).

From How Powerful are Graph Neural Networks?

On another line of work, there was an interesting paper on Learning the Structure of Large Graphs. Using approximate nearest neighbor they obtain a cost of O(n log(n)) for n the number of samples and in the experiments they make it scale up to 1 million nodes (with a Matlab implementation).

Adversarial Learning

Ian Goodfellow talked on Adversarial Machine Learning. The quality of images generated by GAN has improved very fast from 2014, were obtained images were low-resolution and needed further super scaling, to 2019 were images are very high resolution. In recent years, new techniques have been introduced, such as style transfer that makes possible to generate images that would be impossible in a supervised context.

Transfer zebra style on horse video

Adversarial examples help to improve machine learning models by removing a type of bias. One of the uses of GANs is to be able to train a reinforcement learning algorithm in a simulated environment using data that looks like the real world.

Self-play is also part of adversarial learning: it makes it possible for an algorithm such as AlphaGo to learn Go from scratch by playing against itself for 40 days.

Learning Representations

This was a hot topic this year. Invited speaker Léon Bottou talked about learning representations using causal invariance and new ideas he and his team have been working on. Causality is an important challenge in machine learning were algorithms are good at finding correlation but struggle to find causation. The problem is that an algorithm may learn spurious correlations that we do not expect to hold in future use cases. An idea to get rid of spurious correlations is to use multiple context-specific datasets instead of one big consolidated dataset.

Léon Bottou during his talk at ICLR 2019

A key contribution to improve the representations is Deep InfoMax, whose principle was used in a number of other papers of the conference. The idea is to learn good representations by minimizing mutual information between input and output of a deep neural network encoder. To do so they train a discriminator network between positive samples from the joint distribution (i.e. (feature, representation) pairs), and negative samples from the product-of-marginals distribution(i.e (feature, representation not corresponding to that feature) pairs).

From Deep InfoMax paper

Another interesting paper was Smoothing the Geometry of Probabilistic Box Embeddings, that reaches state-of-the-art for learning geometrically-inspired embeddings that can capture hierarchy and partial ordering.

From Smoothing the Geometry of Probabilistic Box Embeddings

The intuition behind the paper is that the “hard edges” of the standard box embeddings lead to unwanted gradient sparsity. The idea is to use a smoothed indicator function of the box as a relaxation of the “hard edges”.

Meta-learning

Also called learning to learn, the idea is to learn the learning process. A nice paper on this topic was Meta-Learning With Latent Embedding Optimization, where authors achieve strong experimental results on both the tiered ImageNet and mini ImageNet datasets compared to MAML (reference paper in the field). The main idea is to map the model parameters in a low dimensional space (that is learned) and then perform the meta-learning on that space.

From Meta-Learning With Latent Embedding Optimization

Overall, ICLR 2019 was very exciting. Thank you to the organizers and contributors. See you next year!