Source: Deep Learning on Medium
Thoughts on “A Critique of Pure Learning”, Zador (2019)
“A critique of pure learning” recently proposed a new framework for developing artificial neural networks (ANNs) capable of rapid learning and animal-like behaviour. Its argument is loosely structured as follows:
(1) Animal brains are not blank slates because animals display impressive behaviours shortly after birth.
(2) These behaviours must largely be a result of learning on the evolutionary time scale, since there’s too little time (and hence data) to learn these behaviours quickly after birth.
(3) Since the genome is too small to literally encode all the synaptic weights of animal brains, evolutionary learning must instead entail encoding of simple and generic wiring instructions and plasticity rules.
(4) Therefore, wiring topology and network architecture should be a target for optimization in artificial systems.
On its surface, most AI researchers working with neural networks would agree with the conclusion that wiring topology and architecture should be a target for optimization. Although neural networks are universal function approximators, in practice we do not have infinite time and infinite data. So, we move beyond neural networks with a single, wide hidden layer, and engineer inductive biases to coax efficient learning of the correct input-output mapping. These inductive biases can manifest as the type of loss function, the distribution of training data on which we train, and, indeed, the wiring topology or architecture.
Nonetheless, I think this conclusion is still contentious. This is because of the implicit claim that AI is being held back by not looking harder at the brain’s wiring topologies and hence by not building better architectures. We should instead, apparently, fervently embrace inductive bias engineering. And if we do, perhaps in a way that’s inspired by animal brains, then perhaps our AI systems would be much better than they are currently.
While I agree with the general fact that AI practitioners should pay attention to and research different wiring topologies and architectures (as opposed to ignoring them outright), I believe the paper’s prescription to place more emphasis on them to be misguided. In the sections below I will enumerate the reasons, which are roughly the following: 1) it too quickly dispenses with the relation between evolutionary learning at the level of wiring topology and weight-based learning across generations; 2) it vastly underestimates the quality of data real-world animals receive; and 3) it overestimates the importance of architecture, while underestimating a number of other factors that are just as important (if not more) for developing AI.
Just because there isn’t enough capacity in the genome to encode all the synaptic weights in the brain, it doesn’t mean that the genome encodes zero information about weights.
Zador’s argument is similar to the classic “innate” vs. “learned” debates of the past, with one twist. Pure learning advocates often justify the need to train ANNs on millions of examples for millions of gradient steps by noting that animals profit from learning in millions of ancestors over millions of years of evolution. ANNs are just making up for this lost learning experience, it’s said, so we cannot directly compare learning efficiency in naive ANNs to naive animals. Zador’s information theoretic argument states that evolution literally cannot encode synaptic weights because there’s not enough space in the genome. Therefore, learning in ANNs (which is weight-tuning) is not equivalent to learning in animals over evolutionary time scales (which is not weight-tuning). We must concede, then, that evolutionary learning, as manifest in the genome, most probably results in encodings of general architectural motifs, wiring topologies, and plasticity rules. Thus, ANN training is indeed excessive compared to that observed in animals, and if we were to instead uncover these “innate” architectural motifs, we’d be able to improve ANNs.
There are two main problems I see with this argument: (1) the assumption that full fidelity weight information, requiring 10¹⁵ bits, needs to be preserved in the genome for weight-based learning to occur across generations, and (2) that genes do not encode a low resolution version of the synaptic weights, using wiring topology as the substrate for reinstantiating them across generations.
There is a recent burst of research showing that ANN performance can be relatively well preserved even after significant interventions, such as distillation into networks a fraction of the size. In a similar vein, new theoretical results show that only small sub-networks are ultimately important for final performance, with pruning even 90% of the network connections being possible. Moreover, there is preliminary work showing the possibility of using just signed weight information (not magnitudes) in feedback connections to accomplish backpropagation like learning.
Even further, we should acknowledge that humans are terrible judges of the number of bits needed to encode a particular behaviour, and as recent results in robots and motor control show, seemingly impressive (or approximations that get pretty close) behaviours are possible with simple networks that can be encoded with a small number of bits. Indeed, these networks may even be small enough to be encoded in the genome. And the converse is also true: there may be behaviours that take a long time to emerge in an animal’s lifetime and seem “simple”, but require an enormous number of bits to encode. In short, humans can’t make judgements about the information content of a network implementing a behaviour, and so we should be very careful about the conclusions we draw from these types of arguments.
All these results point to the fact that rough approximations of small networks may be enough to get most of the way to impressive behaviour. Thus, we do not need to save 10¹⁵ bits of information for weight-based learning across generations to be useful, or even possible. Rather, genomes just need to encode rough approximations of tuned networks.
So how can a genome encode these approximations? By encoding wiring topology! Defining a wiring topology *is* defining weight connectivity, even if it is just doing so roughly, approximately, or at low resolution compared to literally saving every synaptic weight. Any statement about wiring topology is necessarily a statement about the strength of influence between neurons. It may be of lower fidelity than the set of precise weights in a fully connected ANN, but it nonetheless defines one neuron’s influence on the other. Thus, the wiring topology may implement a low resolution version of what literally preserving and reinstantiating synaptic weights in a fully connected network would implement, and this low resolution version may be sufficient for across-generation, weight-based learning to occur, while requiring much less than the claimed 10¹⁵ bits.
This introduces a question about the algorithm for weight-based learning through evolution. Obviously this is an open question, but something like weight- or node-perturbation (with the weights or nodes being perturbed as a result of random genetic mutations’ effect on wiring topologies, and the reinforcing signal being fitness) seems not entirely impossible. These algorithms suffer from scaling issues, but when you consider the massive parallelization possible across populations of organisms, it might not be crazy.
So to sum up, while weights might not be literally encoded in the genome, there is no such thing as a genome that encodes wiring topologies without encoding some information about weight connectivity, since wiring topologies do not exist in the abstract. Perhaps what separates my view from Zador’s here is that I believe that just because you can’t save all the precise weights in the genome, then it doesn’t imply that the genome doesn’t encode useful information about the weights of a network, and it also doesn’t imply that weight-based learning across evolution is impossible. Thus, ANNs may indeed need to spend some time “making up for” learning that occurred in animals at the evolutionary timescale.
(Small note: Zador can still argue that there still may only be a small set of understandable topologies and motifs. But this is conjecture, and the converse cannot be ruled out using his genomic bottleneck ideas, for the reasons stated above.)
Zador dismisses the idea that unsupervised learning has a large effect on animal behaviour. This dismissal is based in part on the idea that the genome literally does not encode weights, and therefore impressive animal behaviours must be a result of innate architectures and non-weight-based learning. I spoke in the previous section why I think this logic is incorrect, but let’s assume it’s alright for now.
Comparisons are often made (in this paper and others) between animals and ANNs learning on supervised datasets. As the comparisons go, these ANNs need millions of labelled examples; vast quantities more than animals receive by the time they exhibit impressive behaviours. Unfortunately this is an apples-to-oranges comparison. Animals receive a glut of extremely high quality data that reveals orthogonal factors of variation, unlike the static sets of images, which are filled with spurious correlations that entrap ANNs. No amount of training steps can make up for impoverished data.
Animals, unlike ANNs, move about their world, experience changing lighting conditions, changing perspectives, and changing object sizes. They also receive multi-modal information; so not only do they perceive things visually, they form connections between visual and auditory streams, tactile information, taste, and smell. Even more, the learned representations are grounded by task relevance, which may be encoded by innate or social reinforcement learning signals. See the entire literature on embodied cognition to see why such grounding may be immensely important for learning.
I think it’s not unreasonable to think that unsupervised (and reinforcement) learning can quickly profit from such a rich data stream. But to hedge my claims a bit, I think I’ll just state that we cannot immediately rule out the idea that unsupervised and reinforcement learning plays a large factor, and that it can have a rapid influence on animal behaviour in even just a few hours after birth.
We have zero proof that (potentially embodied) ANNs learning on an equivalently rich stream of data cannot exhibit behaviours similar to animals. We only have proof that ANNs learning from an abundance of massively impoverished data do not.
Overestimation of the importance of architecture
The previous section alluded to the main thrust of this counter-argument: paying attention to architecture is important, and necessary, but it is far from being sufficient when developing AGI. I’m not sure any legitimate AI researcher thinks that AGI can be achieved from simple supervised learning on static datasets, or that animal-like vision can be achieved by tweaking ResNets trained on ImageNet.
Instead, we need to embrace ideas from ecology, ethology, robotics, psychology, classical AI, deep learning, physics, statistics, you name it. We need ideas from reinforcement learning, multi-agent interaction, communication, unsupervised learning, knowledge distillation and dissemination, embodiment, multi-modality, and many, many others. Tweaking architectures is just one part of the puzzle, and it may in fact be a small one.
Ultimately the main thrust of Zador’s argument is correct because we should indeed pay attention to architectures and innate wiring topologies. However, I think it’s a bit imprecise because it does not acknowledge the relation between learned wiring topologies and “saving weights” across generations, and hence the fact that weight-based learning is possible across generations. Zador’s argument also underestimates the power of a rich data stream, and assumes that the amount of training an ANN needs on impoverished data can be compared to what a more sophisticated, embodied system (a biological brain and body) can learn using real-world data. Finally, Zador overestimates the importance of architecture over a number of other incredibly important factors needed on the way to AGI.
A small ending note on the genomic bottleneck
The genome does not exist in a vacuum. A genome in a dish will not produce a mountain climbing ibex. The genome needs to interact with a very particular environment to do anything meaningful (i.e., transcribe and translate proteins that actually amount to anything), and to ultimately produce an intelligent animal.
This point makes information theoretic arguments about the number of bits in the genome a bit strange. Sure, there may be N bits in the genome, but the complexity of the “decoder” may be enormous, making these N bits essentially meaningless outside the context of an extremely complex environment in which they’re “decoded” [Bear with me with this encoding/decoding analogy, it’s not one I would use personally, but I’d like to try to be consistent with the paper’s logic]. So, it’s not merely that genomes are tuned to the environment in which they exist; rather, genomes literally require rich, complex, and particular environments to be of any use whatsoever. Zador most likely agrees with this. And indeed, he is not arguing for us to examine and learn from the genome per se. Rather, he’s using an information theoretic argument to show that the genome literally cannot encode synaptic weights, so it must encode something else; something powerful enough to allow an ibex to scale a mountain shortly after birth.
But this observation also means that relatively few bits can have an enormous effect because of the role of the real-world “decoder”. ANNs currently do not take advantage of such a powerful interaction, but their story may change if they ever do.