Bengio-Marcus AI Debate Post Mortem, Part I: The Deep Learning Pivot

Source: Deep Learning on Medium

I was shocked enough about Bengio’s pivot that after the debate, I called on the deep learning community to define its terms

Answers were all over the map; definitions ranged from awfully vague (eg. “anything that involves a gradient”, which includes lots of techniques, some over 150 years old, that many would not recognize as deep learning) to statements about deep learning being a “research program” rather than a model (which potentially encompasses pretty much anything, excluding nothing) to other definitions that seem to me to far more specific.

Blake Richards pointed to an important paper that Bengio and LeCun wrote in 2007 (cited over 1000 times) that seems close to how I have used about the term. It defined “deep architectures” as “cascades of parameterized non-linear modules that contain trainable parameters at all levels … [with] outputs of the intermediate layers …] akin to intermediate results on the way to computing the final output”.

To me, that sounds a lot like the kind of architecture found in in AlexNet, the massively-cited 2012 model that many people associate with the resurgence of deep learning: a neat stack of layers: five convolutional layers, some max-pooling layers, and three maximally homogenous “fully connected layers” in which neuron connected to each other. Maybe deep learning has or will become something broader, but AlexNet is pretty much the prototype:

Absent in canonical deep learning papers like AlexNet were all of the rather interesting gadgets that Bengio and tangled over at the debate (gates, attentional mechanisms, pointers, sets, etc). Those are the kinds of more sophisticated things that Bengio develops that I think we should seriously consider; they may have existed on the margins in 2012, but they weren’t part of AlexNet or most of the discussion at that time. And (with the exception of attention) I don’t think many were mainstream until fairly recently. None of those were mentioned in the seminal 2007 discussion of deep architectures. Even in Bengio, Hinton and LeCun’s 2015 Nature paper, most of the focus was on architectures in that early mold, convolutional neural nets and recurrent nets in the classic style. Gates and attention, to say nothing of pointers and sets, were on the sidelines..

At some point something changed.

For the record, AlexNet and Bengio and LeCun’s definition both come close to the target I was talking about when I said that “homogenous, multilayer perceptrons would be inadequate” in bullet point two of my conclusions:

In the days that followed the debate, I got (in addition to a ton of heartfelt praise via Twitter and email that I very much appreciate) the sort of flak I have become accustomed to getting. Strawperson tweets cautioned consumers that one should never say “Deep Learning can’t do X” — when I had done no such thing (if anything Bengio himself was less careful, in his NeurIPS keynote talk in the slide I reprinted above).

Instead, as shown above in my own conclusions slide, I carefully restricted my conclusion to homogeneous multilayer perceptrons. Aside from rhetorical tricks like that, conflating what I actually said with overly general claims that I tried to avoid, most of the after-debate defense in the ML community came from arguing that a larger more open-ended notion of deep learning might provide solutions for reasoning and causality.

Maybe so, maybe not. It is important to realize that none of that actually exists yet — no adequate model for reasoning or causality or compositionality exists yet under any definition of deep learning. Whether you take deep learning to be AlexNet or to be a much-more open-ended framework or “research program”, along the lines of Bengio’s debate slide that frames deep learning as “evolving”, nobody has yet delivered the goods; it’s all hypothetical.

In what remains of this essay I want to consider two things: what the broader definition does and does not entail, and what we might learn for a narrow definition.

Deep learning, broadly defined

What the broader definition of deep learning — deep learning as research program and evolving entity — entails is … basically nothing.

There is perhaps some truth to the notion that the definition has started to shift, that some people in the machine learning community now see deep learning as more methodology than any specific set of tools, but by the time that pivot is complete, there may be no substantive claim left, either about how the mind words or about how we should build AI.

As Rodney Brooks noted in a tweet responding to Tom Dietterich’s claim that “DL is not a method, it is a methodology. DL is not a method, it is a research program”, deep learning in the broad sense could encompass just about anything.

Brooks asked, rhetorically and sarcastically, is deep learning now just anything that we call AI? Or has it become a “land grab”; an effort to take credit for “any future magic that anyone comes up with”?

I concur with Brooks: it does seem (not just from Bengio, but from many points) that the deep learning community is currently positioning itself to take credit for any future technique that anyone might come up with, without really committing to much of anything.

I am reminded of the classic folk tale Stone Soup. A hungry traveler comes to. a village, finds a rock and some water, and declares himself to be making stone soup. Others are enticed to come along and kindly add to the soup, some chicken here, some potatoes there, and so forth. At the end, everyone has a wonderful soup; if we are lucky, all the marketing and hype around deep learning we will play a similar catalytic function.

But a stone is just a stone; it plays no functional role in the soup, adding neither flavor or nutrition. Will the original ideas in AlexNet and models of that sort (which are roughly what many initially meant by deep learning) ultimately prove be an important part of machine intelligence? Or will deep learning ultimately seem like a catalytic footnote, a an impressive result that reinvigorated a movement but wound up as a tiny part in a whole that was ultimately built by a much larger community, an alchemy absorbed and reconceived of, by a later, more-sophisticated science of chemistry?

As Beth Carey pointed out quite rightly the real question ought to be, “Is DL the right scientific model?” But once the term itself relaxes from a specific referent of particular extant models into a “evolving approach” it’s not really a scientific model anymore, at all. It’s a brand name.

Deep learning, narrowly defined.

It should already be clear that Bengio’s most memorable line of the night — ”you are attacking deep learning from the 1980s” (this may be a slight paraphrase) wasn’t remotely fair. At the very least, the class of models that I was attacking was paradigmatic in 2007. AlexNet, probably the most influential deep network of all time, with nearly 50,000 citations, falls into the same class (2012). A certain fraction of still published work falls into essentially the same regime, piling on more layers but still following essentially the same formula. Bengio himself does some of the most innovative work in stretching beyond that class, but it’s hardly fair to say that I am attacking a strawman from 3 decades ago. Indeed, to the contrary, the vast majority of Bengio’s 2016 textbook was squarely in the scope of what I was critiquing — and what he himself had critiqued two weeks before. It’s historical revisionism of the highest order to pretend otherwise.

Even today, the kind of stuff found in AlexNet and described by Bengio and LeCun are pretty central to how most people think of deep learning, even if those specific tools no longer exhaust the deep learning toolkit. We can’t just pretend that those things never existed, nor forbid discussion of them.

Instead, let’s call the central set of techniques that characterized early deep learning and in fact the great majority of what has been published so far— multliayer perceptrons, convolutional nets, and so forth — core deep learning.

Core deep learning took us a long way; it led to tremendous advances in image labeling, speech recognition, and a more.

But it’s also become obvious to many of us that it is insufficient on its own. The problems that Bengio was pointing to in his NeurIPS opening slide above, and that I pointed to in my 2001 book [link] make clear the reason why: core deep learning generalizes poorly outside the training distribution. That in turn leads to poor performance on what some might call higher-level cognition: things like language and reasoning, including causal reasoning.

Rewriting history by redefining terms doesn’t change these facts; if deep learning can capture these higher level processes it willl only by means of augmenting the core.

If you are happy with any possible augmentation to that core still counting as deep learning, because there is a gradient somewhere in the system, you are going to call anything that works deep learning, even if the core is a small part of the overall system.

If you are intellectually honest, you won’t; you’ll look at the system as a whole, once we get it working (nobody has yet) and say something along the lines of “core deep learning is one important piece, and these other clever gadgets are equally important. The way these other gadgets work is that gadget X allows us to do reasoning for reason Y, and gadget P allows us to get deep language understanding to work for reason Z.” And so on, cutting the deep learning rhetoric and focusing on explicating the winning forumla. Ultimately maybe we will have 10 or a 100 clever tools; deep learning will (likely) deserve some of the credit, but there are almost certainly many important new techniques left to be invented.

Time spent trying to defend the honor of deep learning by rebranding is time spent away from trying to figure out what those gadgets should be.

The real question, now, is this: what do we have to add the core, to get a system that is genuinely capable of reasoning, language, planning and so forth?

More about that in Part II.