Transcript of the AI Debate

Source: Deep Learning on Medium

At Mila in Montreal, on Monday, December 23, 2019, from 6:30 PM to 8:30 PM (EST), Gary Marcus and Yoshua Bengio debated on the best way forward for AI.

5,225 tickets were sold for the international live streaming event. There was quite a twitter storm after the #AIDebate. ZDNet described the event organized by MONTREAL.AI as a “historic event”.

Slides, readings and more on the MONTREAL.AI debate webpage.

Transcript of the AI Debate

Opening Address | Vincent Boucher — 3 min.

Good Evening from Mila in Montreal Ladies & Gentlemen,

Welcome to the “AI Debate”.

I am Vincent Boucher, Founding Chairman of Montreal.AI.

Our participants tonight are Professor GARY MARCUS and Professor YOSHUA BENGIO.

Professor GARY MARCUS is a Scientist, Best-Selling Author, and Entrepreneur. Professor MARCUS have published extensively in neuroscience, genetics, linguistics, evolutionary psychology and artificial intelligence and is perhaps the youngest Professor Emeritus at NYU. He is Founder and CEO of Robust.AI and the author of five books, including The Algebraic Mind. His newest book, Rebooting AI: Building Machines We Can Trust, aims to shake up the field of artificial intelligence and has been praised by Noam Chomsky, Steven Pinker and Garry Kasparov.

Professor YOSHUA BENGIO is a Deep Learning Pioneer. In 2018, Professor BENGIO was the computer scientist who collected the largest number of new citations worldwide. In 2019, he received, jointly with Geoffrey Hinton and Yann LeCun, the ACM A.M. Turing Award — “the Nobel Prize of Computing”. He is the Founder and Scientific Director of Mila — the largest university-based research group in deep learning in the world. His ultimate goal is to understand the principles that lead to intelligence through learning.

Diagram of a 2-layer Neural Network

The diagram shows the architecture of a 2-layer Neural Network.

“You have relatively simple processing elements that are very loosely models of neurons. They have connections coming in, each connection has a weight on it, and that weight can be changed through learning.” — Geoffrey Hinton

Deep learning uses multiple stacked layers of processing units to learn high-level representations.

Professor MARCUS thinks that expecting a monolithic architecture to handle abstraction and reasoning is unrealistic.

Professor BENGIO believes that sequential reasoning can be performed while staying in a deep learning framework.

Our plan for the evening

An Opening statement by Gary Marcus and by Yoshua Bengio; followed by a Response, an interview with Yoshua Bengio & Gary Marcus; then our guests we’ll take questions from the audience here at Mila; followed by questions from the international audience.

Agenda : The Best Way Forward For AI

This AI Debate is a Christmas gift form MONTREAL.AI to the international AI community. The hashtag for tonight’s event is : #AIDebate

International audience questions for Gary Marcus and Yoshua Bengio can be submitted via the web form on

MONTREAL.AI is grateful to Mila and to the collaborative Montreal AI Ecosystem. That being said, we will start the first segment.

Professor Marcus, you have 20 minutes for your opening statement.

Opening statement | Gary Marcus — 22 min.

Thank you very much.

And of course the AV doesn’t work. Hang on.

Before we started Yoshua and I were chatting about how AI was probably going to come before AV. He made some excellent points about his work on climate change and how if we could solve the AV problem it would actually be a good thing for the world.

Last week at NeurIPS

So, this was Yoshua and I last week at NeurIPS at a party having a good time. I hope we can have a good time tonight. I don’t think either of us is out for blood but rather for truth.


An overview of what I’m going to talk about today. I’m going to start with a bit of history and a sense of where I’m coming from.

I’m going to give my take on Yoshua’s view. I think there are more agreements than disagreements, but I think the disagreements are importants and we’re here to talk about them, and then my prescription for going forward.

Part I: how I see AI, deep learning, and current ML, and how I got here

Part I: how I see AI, deep learning, and current ML, and how I got here

The first part is about how I see AI, deep learning and current machine learning and how I got here. It’s a bit of a personal history of cognitive science and how it feeds into AI. And, you might think: “What’s a nice cognitive scientist like me doing in a place like Mila?”.

A cognitive scientist’s journey, with implications for AI

Here’s an overview, I won’t go into all of it, but there are some others thing that I have done that I think are relevant to AI. The important point is that I am not a machine learning person by training. I’m actually a cognitive scientist by training. My real work has been in understanding humans and how they generalize and learn. I’ll tell you a little bit about that work going back to 1992 and a little bit all the way up to the present.

But first, I’ll go back event a little bit before to a pair of famous books that people call the PDP bible. Not everybody will even know what PDP is but it’s a kind of ancestor to modern neural networks. Vince showed on and Yoshua will surely be talking about many and the one I have on the right is a simplification of a neural network model that tries to learn the English past tense.

1986: Rules versus connectionism (neural networks)

This was part of a huge debate. In these two books I think the most provocative paper, certainly the one that has stuck with me for 30 years, which is pretty impressive to have a paper to stuck with you for that long. It was a paper about children’s overregularization errors. So, kids say things like breaked and goed some of the times. I have two kids so I can testify that this is true. It was long though to be an iconic example of symbolic rules. So, if you read any textbook until 1985, it would say: “children learn rules”. For example, they make these overregularization errors. And what Rumelhart and McClelland showed brilliantly was that you can get a neural net to produce these output without having any rule in it at all.

So, this created a whole field that I would call “Eliminative Connectionism”: using neural networks to model cognitive sciences without having any rules in it. And this so-called great past tense debate was born from this.And it was a huge war across the cognitive scientists.

the debate

By the time I got to graduate school, it was all that people wanted to talk about. One the one hand up until that point, until that paper, most of linguistics and cognitive science was couched in terms of rules. So the idea was that you learn rules like a sentence is made of a noun phrase and a verb phrase. So, if you’ve ever read Chomsky, lots of Chomsky’s earlier works look like that. And most AI was also all about rules. So expert systems were mostly made up of rules. And here Rumelhart and McClelland argue that we don’t need rules at all, forget about it. Even a child’s error like breaked might be in principle, they didn’t prove it. But, they showed that in principle may be the product of a neural network where you have the input at the bottom and the output at the top and you tune some connections over time, might in principle give you generalization that looks like what kids were doing.

1992: Why do kids (sometimes) say breaked rather than broke?

On the other hand, they hadn’t actually looked at the empirical data. So I tried to get myself up to graduate school to work with Steve Pinker at MIT and what I looked at were these errors. I did I think the first big data analysis of language acquisition. On of the first to write shell script Unix Spark stations and looked at 11,500 child utterances.

The argument that Pinker and I made was that neural nets weren’t making the right predictions about generalization over time and particular verbs and so for. If you care, there’s a whole book that we wrote about it (Marcus et al (1992, SRCD Monographs), See also Pinker’s Words and Rules).

What we argued for was a compromise. We said it’s not all rules like Morris Halle (he was on my thesis committee (phd)) liked to argue and we said it wasn’t all neural networks like Rumelhart and McClelland did. We say it was a hybrid model we said best capture the data. A rule for regulars so walk is inflected in walked in you add to the “ed” for the past tense. Neural networks for the irregulars so this is why you say sing — sang but it might generalize to spling — splang that sound similar. And then the reason why children made overregularization errors we said is the neural network didn’t always produces a strong response. If you have a verb that didn’t sound like anything you’ve heard before you’d fall back on the rules.

1998: Extrapolation & Training Space

So, that was the first time I argued for hybrid models back in the early 1990s. In 1998, or even a little bit before, I started playing a lot with the networks models.

There’s been a lot written about them and I wanted to understand how they work and so I started implementing and trying them out. And, I discovered something about them that I thought was very interesting which is: people talked about then as if they learn the rules in the environment, but they didn’t always learn the rules. At least not in the sense that a human being might. So, here’s an example : if I taught you the function f(x) = x, or you can think of x = y + 0 or different ways to think about it. So, you have inputs like 0110, a binary number, and the output is the same thing and you do this on a bunch of cases then your neural net learns something but also makes some mistakes. So, if you give if an odd number, which is what I have here at the bottom, after giving it only even numbers, it doesn’t come up with the answers that a human being would. And so, I describe this in terms of something called the training space. So, let’s say the yellow examples are the things that you’ve been trained on, and the green ones are the things that are nearby in space of the one you’ve been trained on. The neural network generally did very well on the yellow ones and not so well on the ones that were outside the space.

So, near perfect at learning specific examples, good generalization in the could of points around that, and poor generalizing outside that space. I wrote up in Cognitive Psychology (Marcus (1998, Cognitive Psychology)), after having some battle with the reviewers (we can talk about it some times later), and the conclusion was that the classical limits of connectionists models that is currently popular couldn’t learn to extend universals outside of the training space.

In my view this is the thing that I’m the most proud of having worked on. Some details for later…

1999: Rule learning in 7 month old infants

This led me to some work on infants. What I’m trying to argue is that even infants could make this kind of generalizations that were steaming the neural networks of that day. So, it was a direct deliberate test on the outside of training space generalization by human infants. So, the infants would hear sentences like “la ti ti” and “ga na na” (I read theses to my son yesterday and he think these are hilarious, he is almost 7) and then we tested on new vocabulary. There will be sentences like “wo fe fe” or “wo wo fee”. So one of those has the same grammar that the kids has seen before and the other one has a different grammar. Because all the items were new you couldn’t use some of the more statistical techniques that people thought about like transitional probabilities and it was a problem for early neural networks.

The conclusion was infants could generalize outside training space, where many neural nets could not. And I argued that this should best characterized as learning algebraic rules. It has been replicated a bunch of times and it led to my first book which is called “The Algebraic Mind”.

2001: The Algebraic Mind

The idea was that humans could do this kind of abstractions. I argued that there was three key ingredient missing from multilayer perceptrons:

  1. the ability to freely generalize abstract relations as the infants were doing
  2. the ability to robustly represent complex relations like the complex; structure of a sentence; and
  3. a systematic way to track individuals separately from kinds.

We will talk about the first two today and probably not of the third. And I argued that this underlines a lot of attempts to use multilayer perceptrons as models of the human mind.

I wasn’t really talking about AI, I was talking about cognition. Such models, I argued, simply can’t capture the flexibility and power of everyday reasoning.

2001: symbol-manipulation

And the key component of the thing I was defending, which I called symbol-manipulation (I didn’t invented it, but I tried to explicated it and argue for it), are variables, instances, bindings and operations over variables. You can think in algebra where you have a variable like x, you have an instance of it like 2, you bind it so you say right now x = 2 or my name phrase currently equals the boy, and then you have operations over variables so you can add the together, you can put them together (concatenation, if you know computer programming), you can compare them, and so for…

Together, these mechanisms provides a natural solution to the free generalization problem. So, computers programs do this all the time. You have something like the factorial function (if you’ve ever taken computer programming) and it automatically generalize to all instances of some class, let say integers, once you have that code.

Pretty much all of the world’s software takes advantage of this fact and my argument (eg from baby data) was that human cognition appeared to as well innately.

The Algebraic Mind

The subtitle of that first book (you can’t see it that well here), was integrating connectionism and cognitive science. I wasn’t trying to knock down neural networks and say forget about it. I was saying, let’s take the insights of those things, they’re good at learning, but let’s put it together with the insights of cognitive science with a lot of which has been using these symbols and so for. And so I said, even if I’m right the symbol manipulation plays an important role in mental life, it doesn’t mean we shouldn’t have others things in there too, like multilayer perceptrons which are the predecessors of todays deep learning.

Neural-Symbolic Cognitive Reasoning

I was arguably ignored I think in candor until a year or so ago. People I think started paying attention to the book again. But, it did inspire a seminal book on neuro-symbolic approaches which I hope some people will take a look at, called Neuro-Symbolic Cognitive Reasoning and I’m going to try to suggest that it also anticipated some of Yoshua’s current arguments.

2012: The Rise of Deep Learning

I stoped working on these issues, I started working on innateness, I learned to play guitar (that’s a story for another day) and didn’t talk about these issues at all until 2012 when Deep Learning became popular again. The front page story of the New York Time about Deep Learning and I thought I’ve seen this movie before and I was writing for the New Yorker at the time and I wrote a piece and I said: “Realistically, deep learning is only part of larger challenge of building intelligent machines. Such techniques lack ways of causal relationships. (A lot of discussion about that today). They have no obvious way of performing logical inference, and they are still a long way from integrating abstract knowledge.” And, I once again argued for hybrid models. Deep Learning is just one element in a very complicated set of machinerie.

2018: Critique of deep learning

Then, in 2018, Deep Learning got more and more popular but I thought some people were missing some important points about it, so I wrote a piece (I was actually here in Montreal when I wrote it) called “Deep Learning: A Critical Appraisal). It outlines ten problems for Deep Learning (I think it was on the suggested readings for here) and the failure to extrapolate beyond this space of training was really at the heart of all of those things. I got a ton of flacks on Twitter (you can go back and search and see some of the history). I felt like I was often misrepresented as saying “we should throw away Deep Learning”, which is not what I was saying. And I was not careful enough in the paper: despite all of the problems I’ve sketch, I don’t think we need to abandon Deep Learning which is the best technique we have for training neural networks right now but rather we need to reconceptualize it not as an universal solvent but simply as one tool among many.

The central conclusions of my academic work on cognitive science, and its implications for AI

So, the central conclusions of my academic work concluded the value of hybrid models, the importance of extrapolation, of compositionality, acquiring and representing relationships, causality and so for.

Part II: Yoshua

Part II: Yoshua

Some thoughts on his views, and how I think they have changed a bit over time, a little bit on how I feel misrepresented and how our views are and not similar.

First things first: I admire Yoshua

The first thing I want to say is the tI really admire Yoshua. For example, I wrote a piece recently, squiring the field for hype. And I said, but you know, a really good talk is one by Yoshua Bengio: a model of being honest about limitations. I also love the work that he’s doing for example on climate change and machine learning. I really think he should be a role model in his intellectual honesty and in his sincerity to make the world a better place.

My differences are mainly with Yoshua’s earlier (e.g., 2014–2015) views

My differences with him are mostly about his earlier views. We first met here in Montreal five years ago and at that time I don’t think we had much common ground. I thought like he was putting to much faith in black box deep learning systems, he rely to heavily on larger datasets to yield answers and he’ll talk about system 1 and system 2 later, I guess I will as well. I Felt he was all on the system 1 side and not so much on the system 2 side.

And, I went back and talked to some friends about that. I lot of people remember the talk he gave in 2015 to a bunch of linguists who didn’t like Yoshua’s answer to questions like “ how would we deal with negation or quantification words like every” and what Yoshua did was to say we just need more data and the network will figure it out.

If Yoshua was still in this position, which I don’t think he is, I think we would have a longer argument.

Recently, however Yoshua has taken a sharp turn towards many of the positions I have long advocated

Recently, however Yoshua has taken a sharp turn towards many of the positions I have long advocated for: acknowledging fundamental limits on deep learning, need for hybrid models, the critical importance of extrapolation and so for. I have some slides in camera shots that I took at his recent talk at NeurIPS that I think show a very interesting convergence here.


So, disagreements now.

I’ll take about my position, the right way to build hybrid models, innateness, the significance of the fact that the brain is a neural network and what we mean by compositionally.

And, that’s it, we actually agree about most of the rest.

1. Yoshua’s (mis)representation of my position (1 of 2)

The first one is the most delicate. But, I think occasionally Yoshua is misrepresenting me as saying “look, deep learning doesn’t work”, he said that to IEEE Spectrum. I hope I persuaded you that this is not actually my position. I think deep learning is very useful. However, I don’t think it solves all problems.

1. Yoshua’s (mis)representation of my position (2 of 2)

The second thing is: his recent work has really narrowed what I think is the most important point, which is the trouble deep nets have in extrapolating beyond the data and why that means for example we might need hybrid models. I would like for him to cite me a little bit. I think not mentioning me devalues my contribution a little bit and further represents my background in the field.

2. What kind of hybrid should we seek?

What kind of hybrid should we seek? I think Yoshua was very inspired by Daniel Kahneman’s book about system 1 and system 2 and I imagine many people in the crowd did read it. You should if you haven’t. That talk about one system that is intuitive, fast and conscious and another who is slow, logical sequential and conscious. I actually this that this is a lot like what I’ve been arguing for a long time. We can have some interesting conversation about the differences. There are questions : are these event different? Are they incompatible? How could we tell?

To argue against symbol-manipulation, you have to show that your system doesn’t implement symbols

I want to remind people of what I think is the most important distinction drawn in cognitive sciences, which is by the late David Marr, who talked about having computational algorithmic and implementational levels. So, you can take some abstract algorithm or notion like I’m going to do a sorting algorithm. You can pick a particular one like the bubble sort. And then ou can make it out of neurons, you can make it out of silicons, you can make it out of thinker choice.

I think we need to remember this, we have this conversation, so we want to understand the relation about how we’re building something and what algorithm is being represented. I don’t thing Yoshua made that argument yet. Maybe he will today.

I think that this is we would need to do if we want to make a strong claim that a system doesn’t implement symbols.

Attention here looks a lot like a means for manipulating symbols

Yoshua has been talking a lot lately about attention. I think that what he is doing with attention remins me actually of a microprocessor in the way that it pulls things out of a register and moves them in to the register and so for. In some ways it seems as it behaves at least a lot like a mechanism for storing and retrieving values of variables from registers, which is really what I’ve cared about for a long time.

“We tried symbols and they don’t work”

Then, I’ve seen some arguments from Yoshua against symbols. Here’s something in an email he sent to a student, he wrote: “What you are proposing [a neuro-symbolic hybrid] does not work. This is what generations of AI researchers tried for decades and failed.” I’ve heard this a lot, not just for Yoshua, but I think it is misleading. The reality is that hybrids are all around us. The one you use the most probably is Google search which is actually a hybrid between a knowledge graph, whic is classig symbolic knowledge, and deep learning like a system called BERT. Alpha Zero, which is the world champion (or it was until recently) is also a hybrid.

Vincent Boucher: Professor Marcus, you have 5 more minutes.

OpenAI’s Rubik’s solver is a hybrid.

Mao et al, arXiv 2019

There is great work by Joshua Tenenbaum and Jiayuan Mao that is also a hybrid that just came out this year.

Lots of knowledge is not “conveniently representable” with rules

Another argument that Yoshua has given is that lots of knowledge is not coveniently represented with rules. It is true, some of it is not conveniently represented with rules and some of it is. Again, Google search is a great example where some is represented with rules and some is not it is very effective.

3. Innateness

The third argument, and I don’t fully know Yoshua’s view, is about nativism. So, as a cognitive development person, I see a lot of evidence that a lot of things are built-in in the human brain. I think that we are born to learn and we should thank about it as nature and nurture rather that nature vs nurture.

I think we should think about a innate framework for understanding things like time and space and causality as Kant argued for in the Critique of Pure Reason and Spelke argued for in her cognitive development work.

The argument that I’ve made in the paper here on the left, is that richer innate priors might help artificial intelligence a lot. Machine learning has historically typically avoided nativism of this sort. As far as I can tell, Yoshua is not a huge fan of nativism and I’m not totally sure why.

Here is some empirical data showing that nativism and neural networks works. It comes form a great paper by Yann LeCun in 1989 where he compare four differents models. The ones that had more innateness in terms of convolutional prior were the ones that did better.

This is a picture of a baby ibex climbing down a mountain. I don’t think the anybody can reasonably say that there is nothing innate about the baby ibex: it has to be born with an understanding of the 3 dimensional world and how it interacts and so for in order to do the things that it does. So nativism is plausible in biology and I think we should use more of it in AI.