Source: Deep Learning on Medium
Yoshua Bengio provides an interesting perspective on the function of consciousness in his very concise and readable paper, “The Consciousness Prior.” Here I provide a short overview of how the consciousness prior works and how it might be beneficial in solving some of the learning challenges humans face. The ideas here are a mix of my own and those from the paper.
First, the name requires some explanation. Bengio refers to consciousness in the sense of awareness or attention rather than qualia. Our consciousness seems to be characterized by the awareness of a small number of high-level/abstract features that are relevant to the task at hand, and it is this aspect of consciousness that Bengio is concerned with modelling.
The term ‘prior’ in ‘consiousness prior’ can be thought of as essentially inductive bias. The neural architecture underlying consciousness has a certain inductive bias that favors some hypotheses over others, and this inductive bias can be thought of as a prior distribution over hypotheses. This is a point that I think should be made more often: when viewing any ML algorithm from a Bayesian perspective, the inductive bias of that algorithm implicitly specifies a prior distribution over all possible hypotheses.
The consciousness prior theory models the unconscious functions of the brain as an RNN whose output is a high-dimensional vector h_t — called the “representation state” — that captures the full information available to the agent in an abstracted form in which explanatory factors are disentangled:
where s_t is the observation state at time t and h_(t-1) is the representation state from the previous time step.
The conscious functions of the brain are modeled by another RNN that aggregates a small number of elements of h_t into a complex conscious thought represented by the output vector c_t:
where z_t is a random noise source that adds some randomness as to which elements get added to consciousness. It encourages exploration. I could imagine that someone with ADD might have a strong noise parameter.
Bengio does not discuss how consciousness prior theory relates to other theories in psychology, but some relations stand out. It seems play a role similar to working memory, in which only a limited number of highly-relevant abstract features are present at a time. Another possibly useful relation can be made with predictive coding, which views the process of devoting attention as increasing the gain/importance/precision of some specific prediction error.
So what is the consciousness prior good for? It generates a low-dimensional space at a high level of abstraction, and below I list several possible advantages that this could provide (many of these points are totally redundant, but it might be useful to state the same thing in different ways):
- Increased precision. Abstractions are formed by throwing away details. That is, by reducing variance/increasing precision. More abstract representations are therefore more precise. As Dijkstra put it, “Being abstract is something profoundly different from being vague … The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise.” This builds a bridge between modern AI techniques such as deep learning and classical symbolic AI, which deals with precise representations of knowledge and rules. This is great article that explores the idea further.
- Curse of dimensionality. The consciousness prior could be thought of as a very advanced form of feature selection that make downstream learning processes easier by reducing dimensionality.
- Optimized search. A small step on a high level of abstraction corresponds to a large step on a low level. For example, suppose you need to make a PB & J. The search space for a solution at the lowest level of abstraction —the level of sequences of individual motor commands — is massive. Searching for a solution only becomes tractable by chunking low-level commands into high-level commands such as ‘opening the fridge’. This is commonly known as temporal abstraction.
- Blessing of abstraction and bias-variance tradeoff. Read this and this. In short, increased abstraction => larger sample sizes or larger ‘effective’ sample sizes.
- Generalization. Generally, the more abstract a concept is, the more real-world instantiations it has (The concept of dog has fewer instances than the concept of animal) which means that more abstract knowledge is more broadly applicable. No one ever steps in the same river twice, but somehow we use past experience to make predictions about the future. We seem to do this by searching for shared underlying abstractions (that is, analogies) between past experience and the present situation. This is how knowledge can be transferred, and it is why I like to think of abstractions as kind of like unifications. This might explain the magical ability humans have of being able to solve few-shot/zero-shot learning problems (although the prior knowledge could have been built in by evolution).
- The right level of abstraction. Abstractions are invariant to the details they abstract away, and there are many details about the world that are simply inconsequential to an agent’s goals that can safely be ignored. In other words, there are many concrete outcomes that would qualify as satisfying a goal, and it doesn’t matter which one actually happens. The right level of abstraction is the level that is invariant to these inconsequential details, and this is presumably the level that the consciousness prior selects. In the paper, Bengio provides a good example: It would be very difficult to predict how exactly a tower of blocks might fall, if they fall, but humans are pretty good at predicting merely whether the blocks will fall or not. As a general principle, the more abstract your predictions, the easier it is to be right. Making abstract qualitative predictions (such as that the price of Amazon stock will go up) is easier than making concrete quantitative predictions.
- Transmission of knowledge. The low-dimensionality of c_t allows for ideas to be easily transmitted between humans that naturally have limited bandwidth for communication. Bengio speculates that the need to communicate c_t acts as another constraint that sharpens representations. Fuzzy ideas aren’t easily communicated and become degraded with each successive transmission (think of taking a picture, then taking a picture of the picture, and so on). Here are some really cool slides, also from Bengio, about how deep learning relates to cultural evolution and the transmission of knowledge.