Our favorite papers from ICLR 2020

Original article was published on Deep Learning on Medium

Our favorite papers from ICLR 2020

Eric Allen and Rafael Tena, Tulco Labs

The current Covid-19 pandemic has reshaped social interactions around the world, and scientific gatherings have been no exception. The plan for this year’s ICLR conference (the premier event for scientific research in machine learning representations) was to hold it in Addis Ababa, the capital city of Ethiopia. However, due to the current crisis, the event was urgently redesigned to become what was perhaps the first fully virtual top tier academic conference. While the in-person robust exchange of ideas and state-of-mind that comes from being fully immersed in a physical location were missed, the overall experience was excellent, and offered some distinct advantages to attendees. The virtual format allowed participants from all over the world to asynchronously access pre-recorded content on demand, attend livestream Q&A sessions with keynote speakers and presenters, and engage in online chat sessions with participants, all at a fraction of the cost of physically attending. We hope the virtual format will remain a permanent addition to physical conferences. In a post-pandemic future, maintaining virtual conference tracks could boost attendance of students and researchers from less affluent parts of the globe and help to democratize scientific inquiry. ICLR drew over 5,600 participants from nearly 90 countries, more than doubling last year’s 2,700. Although it is difficult to assess how much of this growth was driven by the virtual format, the reduction in both cost and scarcity is undeniable. The conference program consisted of 8 keynote speakers, 4 expo sessions, 20 poster sessions, 15 workshops, 20 social events, a chat room, and virtual break rooms for discussions. Moreover, ICLR accepted 687 out of 2,594 submissions. That is a lot of content! We’ve gleaned through much of it to bring you our favorites. So, In no particular order, and with no claim to any sort of objectivity whatsoever, here are the parts that stood out to us.

Reinforcement Learning

Model Based Reinforcement Learning for Atari

Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Błażej Osiński, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, Henryk Michalewski

Google Brain, deepsense.ai, Polish Academy of Sciences, University of Warsaw, University of Illinois at Urbana-Champaign and Stanford University


A lot of effort in the reinforcement learning community has used the Atari 57 games as a common benchmark to compare algorithms. Although Deep RL has made great strides in achieving high performance at these games, the RL algorithms used have required huge amounts of data, much more than humans require to achieve the same performance. The authors suggest that humans are able to learn so quickly because they learn “how the game works”; essentially, they form a model. They apply a form of model-based reinforcement learning they call Simulated Policy Learning (SimPLe) and show that it outperforms state-of-the-art reinforcement learning when learners are limited to just 100,000 interactions (about two hours of real-time play), a much more realistic volume of experience to train on when comparing to human play.

Learning the Arrow of Time for Problems in Reinforcement Learning

Nasim Rahaman, Steffen Wolf, Anirudh Goyal, Roman Remme, Yoshua Bengio

Mila, MPI, and Ruprecht-Karls-Universität


A really nice idea that gives us yet another way of capturing intrinsic reward in reinforcement learning problems. In a Markov Decision Process (MDP), imagine that we had a way to know which actions are irreversible. Such a graph could be useful in choosing between two actions; all things being equal, we would probably want to prefer actions that don’t “paint us into a corner,” and prevent us from getting back to where we started. The idea in this paper is to use a deep net to learn a potential function on observed states: The inputs are the states and the labels (i.e., potential function values) are the time steps in which a state occurred. With such a function, we can form an expectation as to whether one state is reachable from another by comparing their potential function values. As the title of the paper suggests, we can view our net as encoding a notion of the arrow of time and use it to guide our learned policies.

Symbolic Manipulation

Deep Learning For Symbolic Mathematics

Guillaume Lample, François Charton

Facebook AI Research


There are increasingly many attempts to connect neural networks with symbolic computation, an important area of focus if deep learning is to impact many of the broader problems of AI. This work tackles a few fundamental symbolic tasks: symbolic integration and solution of ordinary differential equations (first and second order). Formulas are represented unambiguously in prefix notation as sequences of operations and variables, and fed to a pretty standard transformer. So, for the integration problem, formulas are fed into the transformer and corresponding integrals are read out. Training and test data is generated, and much of the effort in making this approach work involves constructing data in the best way. Straightforward approaches (such as just randomly generating formulas as outputs and computing derivatives as corresponding inputs) results in lopsided datasets: The inputs are much bigger than the outputs, causing the transformer to generalize poorly. The authors manage to construct more balanced datasets using integration by parts. For differential equations, they also use beam search to generate and check multiple potential outputs until a correct one is found. The results outperform state-of-the-art rule-based approaches in commercial tools.

Neural Symbolic Reader: Scalable Integration of Distributed and Symbolic Representations for Reading Comprehension

Xinyun Chen, Chen Liang, Adams Wei Yu, Denny Zhou, Dawn Song, Quoc V. Le

Google Brain and UC Berkeley


Have you ever wondered about automatically solving the kinds of word-based math problems you might find in a grade school textbook? The authors present NeRD (Neural symbolic ReaDer), a system that takes as input math problems in English text. Not only does it output the answers, but, like any good student, it shows its work. First BERT is connected to the input, and the output is connected to a programmer, essentially an LSTM that outputs calculation programs written in a small programming language. The programming language includes primitives for referencing entities and quantities in the text; these primitives are evaluated using a variety of NLP techniques such as entity resolution. Moreover, their language supports recursive composition, enabling increasingly complex solution techniques. The program is then executed to produce an answer. Training is done on some standard datasets for reading comprehension (MathQA and DROP). Of course, the datasets include only the questions and answers, not the NeRD programs, so training is semi-supervised; the researchers provide some program annotations and then train using Expectation Maximization. The result outperforms the baselines; on MathQA it outperforms by 25.5%.

Convolutional Neural Networks

Contrastive Learning of Structured World Models

Thomas Kipf, Elise van der Pol, Max Welling

University of Amsterdam


If an AI system is to understand and interact with our world the way we do, some things it will need to do are: i) break an environment into component objects, ii) anticipate the interactions between objects. The authors take a step toward building those facilities by introducing systems they call Contrastively-trained Structured World Models (C-SWMs). Going from raw pixels, they use a CNN to produce a set of vector representations of a set of objects in a scene. (An important limitation is that the number of objects represented is fixed.) The output of the CNN feeds into an MLP which produces latent representations of each object. They then feed these latent representations into a transition model, implemented as a graph neural network, which predicts the next state for each object. Training is done entirely in the latent space, to keep the focus on learned features relevant to transitions. Evaluation includes testing in various physics simulators as well as Pong and Space Invaders. Qualitatively, the transition function mirrors many aspects of the training environment. For example, in their physics environment, objects move in circular motions, and the latent variables also move circularly in transition space. Quantitative performance improves on earlier results with world models (see Ha and Schmidhuber, 2018).

Deep Learning Theory

Identity Crisis: Memorization and Generalization Under Extreme Overparameterization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, Yoram Singer

Google Brain, UC Berkeley, and Princeton University


One of the most fascinating mysteries of deep neural networks is their ability to transcend the usual tradeoff between bias and variance. Large networks are, on the one hand, able to memorize their training data, and on the other hand, generalize what they’ve learned. This paper explores this phenomenon for both fully connected nets and convolutional nets in the extremely simple case where there is only one training example: an MNIST image labeled with itself. With one training example, two natural functions that a net might learn are: i) the constant function (always output the training example) or ii) the identity function. It turns out that one-layer fully connected nets generalize in an idiosyncratic way based on their initialization. Meanwhile, deep nets with ReLU activations bias toward the identity function. Interestingly (and despite the well-known expressive equivalence of single layer linear nets and deep linear nets), deep linear nets have an inductive bias more like deep nets: They bias toward the identity function. On the other hand, shallow convnets tend to learn the identity function, whereas deep convnets once again bias toward a constant mapping.

Differentiation of Blackbox Combinatorial Solvers

Marin Vlastelica Pogančić, Anselm Paulus, Vit Musil, Georg Martius, Michal Rolinek

MPI, and Università degli Studi di Firenze


Ever since Ali Rahimi’s controversial Test of Time Award Talk at NIPS 2017, there has been an increased focus on stamping out the “alchemy” in deep learning. This paper dispels some of the most persistent folklore about neural networks. For example, highly suboptimal local minima really do exist in the loss landscapes of realistic neural networks on realistic problems. Not only do they provide a proof under realistic assumptions; they also provide an empirical investigation. It turns out that careful weight initialization is important for avoiding these local minima. Similarly surprising analyses are provided for the effectiveness of weight decay, neural tangent kernels, and rank minimization.

At the Intersection with Neuroscience

Emergence of functional and structural properties of the head direction system by optimization of recurrent neural networks

Christopher J. Cueva, Peter Y. Wang, Matthew Chin, Xue-Xin Wei

Columbia University


One question a lot of deep learning researchers have in the back of their mind is: How much of what we are learning about artificial neural networks provided insights into biological brains? This paper makes some headway on this question, by looking at one of the few biological neural circuits we understand (a navigation circuit in flies) and seeing if we can cause the structure of this circuit to emerge in an artificial neural network. And it turns out that we can: The authors trained an RNN to predict head direction of flies, and observe many of the same navigational structures in the flies’ brains emerge in the RNN! This suggests not only that artificial neural networks are indeed capturing important aspects of biological neural structure, but that many of those aspects generalize beyond a specific biological instantiation.

Time Series

N-BEATS: Neural basis expansion analysis for interpretable time series forecasting

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

Element AI and Mila


Time series (TS) forecasting is an important business problem. It underlies critical areas of modern business, such as inventory control, customer management, planning, finance and marketing. Although every point of forecasting accuracy gained can be worth millions of dollars, limited interest on the topic has resulted in ML and DL underperforming classical statistical TS forecasting approaches. The recent success of a hybrid combining an LSTM stack with a classical Holt-Winters statistical model has lent credence to the idea that hybrid approaches are the way forward. This paper challenges that assumption by delivering a pure DL architecture that uses no time-series specific components and outperforms hybrid and well-established statistical approaches, while also being interpretable to derive “seasonality-trend-level” insights. The basic building block has a fork architecture with two parts: i) a fully connected network that produces forward and backward prediction of basis coefficients, and ii) a forward and backward basis layers that use the basis coefficients to produce backast and forecast outputs, where the backast is a reconstruction of the input given the constraints on the functional space that the block can use to approximate signals.

Applying Deep Learning

Deep Double Descent: Where Bigger Models and More Data Hurt

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever

Harvard University and Open AI


Figuring out the optimal size of a deep model and the number of epochs to train for is a challenging empirical process. This paper shows that a variety of modern deep learning tasks exhibit what the authors call a “double-descent” phenomenon where as model size is increased, performance first gets worse and then gets better. Moreover, double descent also manifests as a function of training epochs. Experiments show that the phenomenon is robust and occurs in a variety of tasks, architectures, and optimization methods. The authors then derive what they call the effective model complexity (EMC) of a training procedure as the maximum number of samples on which a model can achieve close to zero training error. Finally, while in most settings increasing the number of samples decreases error, the authors show situations in which more data is actually worse.

Distance-Based Learning from Errors for Confidence Calibration

Chen Xing, Sercan Arik, Zizhao Zhang, and Tomas Pfister

Google Cloud AI


Confidence estimation, or ‘confidence calibration’, is still a challenging problem for DNNs. For a ‘well-calibrated’ model, predictions with higher confidence should be more likely accurate. However, studies have shown that traditional training to minimize the softmax entropy-loss produces poorly calibrated models that are overconfident even when the classification is inaccurate. This paper proposes Distance-based Learning from Errors (DBLE), a training method to produce better-calibrated DNNs. DBLE learns a distance-based representation space through classification and exploits distances in the space to yield well-calibrated classification. In this space, a test sample’s distance to its ground-truth class center can calibrate the model’s performance. However, during inference the distance to the ground-truth class center is unknown. This is solved by training a separate confidence model, jointly with the classification model, that estimates this distance at inference. The confidence model is trained with samples mis-classified during the training of the classification model. DBLE achieves superior confidence calibration on multiple DNN models and datasets.

Novel Ideas

Convolutional Conditional Neural Processes

Jonathan Gordon, Wessel P. Bruinsma, Andrew Y. K. Foong, James Requeima, Yann Dubois, Richard E. Turner

University of Cambridge and Microsoft Research


Conditional Neural Processes (CNPs) are an approach to a very general problem: Given a finite set of observed x,y pairs, produce a model for an inferred function. Include uncertainty in the model by viewing the function as a Gaussian random variable. CNPs build on the notion of deep sets (presented by Zaheer et al. at NIPS 2017) to embed each observed pair in a vector space, pool the embeddings, and feed the pooled embedding to a multilayer perceptron, along with new x values to predict. The output of the MLP is a mean value and variance of the expected output. Great! But the authors of this paper point out that the result leaves much to be desired: It often leads to both underfitting and a failure to extrapolate to new x,y training pairs. They propose to solve this shortcoming by embedding the x,y pairs in a space of functions, mapping x values to distances rather than absolute values, and feeding the result to a convolutional net rather than an MLP. These changes enable them to enforce an important inductive bias: The inferred functions are translation equivariant: Inferred functions are similar for inputs that are similar aside from a simple translation. This is an important inductive bias that holds for many popular applications of CNPs, such as, say, time series analysis (inferred functions for a set of points should not depend on the specific times in which the points occurred).

Smooth markets: A basic mechanism for organizing gradient-based learners

David Balduzzi, Wojciech M. Czarnecki, Tom Anthony, Ian Gemp, Edward Hughes, Joel Leibo, Georgios Piliouras, Thore Graepel

Deep Mind and Singapore University of Technology and Design


We are finding more and more ways in which training a neural network benefits from putting multiple networks in adversarial relationships. Generating pictures and videos with GANs, and protecting nets against attacks, are just a few examples. And we need not limit ourselves to two networks; competition against three or more networks at once can be useful. But we also know from game theory that n-player games are difficult to analyze and understand. The authors of this work explore how we can frame the sorts of games we might be interested in when training nets together. The formalize the notion of smooth markets, a class of games where i) all interactions are pairwise, ii) resources are exchanged but never created nor destroyed, and iii) the objectives for each player are differentiable. It turns out that this class of games encompasses many of the various ways in which nets have been used to train one another. Moreover, such games have very nice properties, such as having stable Nash equilibria, and that the neighborhoods around solutions to these equilibria can be found by strategies like hill-climbing. Hopefully, delineating this class of games will lead to new ideas on how to combine the training of multiple adversarial nets.

Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks

Alejandro Molina, Patrick Schramowski, and Kristian Kersting

TU Darmstadt


An important building block of deep learning is the non-linearities introduced by the activation functions. They play a major role in the success of training deep neural networks, both in terms of training time and predictive performance. However, in practice, activation functions are chosen empirically following recommendations by prior art and applied to the entire network. In contrast, this paper proposes to make activation functions learnable. To this end, the authors introduce Padé Activation Units (PAU). The Padé approximant is the best approximation of a function by a rational polynomial of order m on the numerator and n on the denominator. To use Padé networks, one makes the parameters of the Padé polynomial learnable and seamlessly optimizes them with the rest of the network. To ameliorate the impact on training time it is possible to learn one PAU per layer and the PAU coefficients are initialized with values that approximate standard activation functions.

From Variational to Deterministic Autoencoders

Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, and Bernhard Scholkopf



Generative models lie at the core of machine learning. They allow one to reason about data probabilistically, access and traverse the low-dimensional manifold the data is assumed to live on, and ultimately generate new data. Variational Autoencoders (VAEs) are a popular generative approach that optimize the reconstruction quality of encoded samples while encouraging the latent space to follow a fixed prior distribution. However, VAEs tend to strike an unsatisfying compromise between sample quality and reconstruction quality and can be difficult to train in practice. In this paper, the authors propose an alternative generative modeling framework that is simpler, easier to train, and deterministic, yet has many of the advantages of VAEs. They observe that sampling a stochastic encoder in a Gaussian VAE can be interpreted as simply injecting noise into the input of a deterministic decoder and then investigate how to substitute this noise injection mechanism with other regularization schemes in the proposed deterministic Regularized Autoencoders (RAEs). The authors proceed to equip RAEs with a generative mechanism via a simple ex-post density estimation step on the learned latent space so that they can be used as a drop-in replacement for many common VAE architectures. RAEs are compared to VAEs on several standard image datasets and on more challenging structured domains such as molecule generation.

Insights from the keynote speakers

Machine Learning: Changing the Future of Healthcare

Mihaela van der Schaar

University of Cambridge, The Alan Turing Institute, and UCLA


Prof. Mihaela van der Schar delivered a keynote presentation focused on the specific challenges that medical problems pose to Machine Learning techniques. In addition to the quality of the technical material, the work in the healthcare space by van der Schaar and her collaborators is an excellent example of how to conduct well directed research so that each individual piece of work not only contributes to advancing the state-of-the-art of a technical field but builds on top of the previous one to attain a long term objective. Among the many challenges that medicine poses to ML, some that stand out are i) the difficulty to verify algorithmic performance in the real world because of the importance of counterfactuals, i.e. if an algorithm prescribes a treatment, even if the outcome is successful it is impossible to know if another treatment would have produced a better outcome, ii) datasets are limited and biased, and iii) the outputs of an algorithm need to be understandable to many stakeholders that have different goals, i.e. a patient might want to understand what factors contribute to a given prognosis, a researcher might want to empirically confirm patterns discovered by the algorithm, and a physician might want to understand risks factor to taylor a particular treatment. While van der Schaar’s work is focused on healthcare, it is not difficult to imagine how her work can be generalized into other applications that involve dealing with individualized risk, such as insurance. Here is a list of the work featured on the keynote to address four important questions in healthcare applications:

How to issue predictions about healthcare outcomes?

How to provide interpretability to the healthcare models doing the predictions?

How to dynamically forecast disease progression?

How to individualize treatment effects to know when to treat, how and when to stop?

The Decision-Making Side of Machine Learning: Computational, Inferential, and Economic Perspectives

Michael I. Jordan

(UC Berkeley)


Prof. Michael Jordan delivered a keynote presentation at the intersection of economics, decision science and machine learning. The work of Jordan and collaborators on ML viewed from the lens of Economics brought a perspective that is uncommon in technical ML gatherings like ICLR. Jordan argues that ML has matured as a science and it is now a new field of engineering, and in the same vein that Chemical engineering delivered the benefits of Chemistry on an industrial scale, ML engineering is emerging to do the same. However, if we are to build systems of planetary scale with data, we need to be thinking beyond a single system of predictions and more about how decision making systems interact.

In the real world, making decisions by setting the threshold for a classifier or predictor is not enough. For decisions with consequence in the real world, provenance, dialog and counterfactuals matter. Outcomes rarely come from a single decision but usually from an array of decisions by a single person or many people and all of them interact with each other.

Another insight driving Jordan’s work is that markets can be viewed as decentralized algorithms. They accomplish complex tasks like bringing goods into a city with incredible robustness. They adapt to changes in social and physical structure, are scalable, and can have a very long life. When working with ML algorithms, what happens when scarcity comes into play? What happens when a traffic recommender recommends the same best route to everyone? In these cases, what counts as a “good decision” depends on what other decision makers are doing, and a good ML decision-maker should model this interaction. Here is a list of the works featured on the keynote that is aimed at solving challenges with deploying ML for consequential decisions in the real world:

Reflections from the Turing Award Winners

Yann LeCun and Yoshua Bengio

(NYU, Facebook AI Research, MILA, and Université de Montreal)


Yann LeCun, Yoshua Bengio, and Geoffrey Hinton were awarded the Turing Award in 2018 for their work in developing deep learning. In this keynote, LeCun and Bengio outlined their thoughts on the current state-of-the-art in AI and what challenges lie ahead. While it was interesting to hear them talk about future research directions, some of the most insightful thoughts came from the Q&A session that ensued their keynote remarks. While deep learning has proved to be a formidable tool, contrary to media hype, it is unlikely that human level AI is around the corner. Scaling supervised learning and reinforcement learning, which are the current pillars of the field, will not deliver rat or cat-level intelligence, let alone human intelligence. LeCun believes that progress in the nascent field of self-supervised learning, and in particular regularized latent-variable energy-based models, could lead to animal-level intelligence; after all, animals learn through self-supervision. Bengio likes to frame progress through the lens of the work of behavioral economist and Nobel laureate Daniel Kahneman. The brain uses system 1 for fast thinking tasks that can be done unconsciously and the current state-of-the-art in deep learning is good at this. Conversely, system 2 is used for slow tasks that are conscious and require rational thinking, and this is what the next generation of deep learning algorithms must be able to do. A core ingredient for conscious processing is attention and other topics that are relevant are out-of-distribution generalization, systematic generalization, and modular low-dimensional knowledge representations that can be recombined on the fly. To tackle some of these challenges the ML community needs to keep a better track of the progress being made in the fields of neuroscience and psychology to understand intelligence. Although concerns about the dangers that superhuman AI poses in the near future seem exaggerated, the community needs to be more attentive to the harm that current ML technology can already cause. Perhaps one of the most interesting insights came from the topic of explainability and how to help humans trust ML systems that are not interpretable. While there might be cases where explainability could be fundamental, in general there should be no distinction between trusting ML systems and trusting human beings. We trust people because they seem to be good at the task at hand. There are many things we accept in society today for which we don’t have explanations. A great example is that Lithium is currently the best drug we have to control bipolar disorder and yet we don’t know how it works. However, decades of empirical scientific results in millions of people have shown us that it is an effective treatment. Accordingly, robust, extensive testing is a better model for long term trust than the elusive explainability.