The Quantum Mechanics of Language

Original article was published on Artificial Intelligence on Medium


Get Mila: The Podcast for free on SoundCloud.


Theme music: “Psychedelic Vox” by Mila PhD student Gabriel Huang, under his stage name Poivre Noir

LAURA: Welcome to Mila: The Podcast, conversations with researchers at the Quebec AI Institute. Our first guest is Jacob Miller, a postdoc with a physics PhD, who’s using quantum mechanics to learn grammar.

(to Jacob) So, what do you work on?

JACOB: One very, very floofy, kind of high-level way of explaining it is that if a neural network is an architecture that’s originally inspired by brains and connectivity patterns of neurons — even if it’s not a very biorealistic representation — what we’re working with is inspired by entangled quantum systems. And so the metaphor is not necessarily the algorithm as a synthetic brain, but it’s the algorithm as a synthetic quantum mechanical system.

And in particular, all of the methods, or all of these models, that I’m going to be talking about are based on this idea of a tensor network, where rather than having a bunch of stacked layers where you have weight matrices and nonlinearities and weight matrices and repeating that, you instead take all of your data and associate it with multidimensional tensors which you then multiply together in a generalization of matrix multiplication.

In their NeurIPS 2016 paper [1], Miles Stoudenmire and David Schwab use this illustration to show how a tensor network can take a six-dimensional tensor (left) and express it as a contraction of six tensors with fewer dimensions (right), which reduces the total number of components you need to represent it. Physicists do this to describe a big quantum system more efficiently, as the contraction of the possible states of its quantum particles. Now computer scientists are using the same trick to make tensors out of their datasets (image, language, what have you) and learn simpler, contracted forms of the patterns that govern them.
Three Mila researchers illustrate the same concept with slightly flashier graphics in their AISTATS 2019 paper [4]. Their four-dimensional tensor (left) is decomposed into four smaller tensors (right). The rank of the smaller tensors, R, is a hyperparameter in the learning model: choosing a smaller rank will mean storing fewer components, so it is more efficient, but it might also turn out to be less accurate.

LAURA: But why… so, why would I want to do that? People wanted to build artificial neural networks that roughly resemble the brain because they wanted to do a task that was roughly like what the ventral visual pathway is doing.

JACOB: Yeah, yeah.

LAURA: So why would I want a network that is like an entangled quantum system?

JACOB: Yeah, so I think for that… these architectures sort of look like an interesting generalization of kernel-type models, like a kernel SVM, but it brings in a lot of new features that are not really present in kernel-type models. They’re a bit more expressive, and you can do things that are not just taking the inner product of different data points.

LAURA: So give me an example of something that I could do with this approach that I couldn’t do with a classical neural network.

JACOB: Yeah, you can… So, one thing you can do is you can have generative models, where normally, if you want to sample from some probability distribution, it’s this hard task to actually get what the normalization constant is. So when you sample, you can produce samples, but if you want to say what’s the exact probability of this thing that I sample occurring, or what’s the exact probability of some input data, you would need to do some sort of approximation of that sample, some estimate of what the overall normalization factor should be in your likelihoods. And in our models, that can be computed exactly, owing to the fact that you can sum over exponentially many different outcomes in a principled way that makes it much, much more efficient.

LAURA: It’s a pretty new idea, right? When did this come on the stage?

JACOB: The first paper that really put forward the ideas behind this was 2016.

LAURA: So, who made the jump in 2016? Who wrote that paper, and what were they trying to solve?

JACOB: Yeah, so it was a physicist named Miles Stoudenmire, who works at a place called the Flatiron Institute in New York City, and a guy named David Schwab who is a sort of physicist slash… I think theoretical biologist… I don’t know him as well… but they wrote this paper together that said, “Hey! We can take these ideas that people have been using for well over a decade in many-body physics and basically take these techniques for simulating complex quantum systems and say, ‘Hey, let’s use these instead to reproduce the patterns and correlations that we see in real-world data.’”

[Jacob has asked me to insert a shout-out here to the near-simultaneous work of Alexander Novikov, Mikhail Trofimov, and Ivan Oseledets [2], discussed later in this interview.]

And if I were going to explain this to a physicist, the idea is that you take whatever your input is — like if you have an image, or something like that— and rather than saying, “Oh, this image is 10×10, so it’s 100 pixels,” you say, “Each one of those pixels represents a collection of different possibilities. There’s all of the values that those pixels could be.” And you treat each of these pixels as if it were a quantum system, and then try to learn a many-body quantum wave function that encodes the actual distribution of data that you have.

LAURA: It’s a little strange for me to hear you say that, because I remember — it was a mathematician, strictly, not a physicist, several years ago — when explaining the difference between the quantum level and the mesocosm, as it were, saying that quantum systems were different because they had superposition and using the example of a pixel, to say a pixel cannot have multiple values at once.

JACOB: Oh, yeah. It’s true. Yeah.

LAURA: But does a tensor network treat it as if it could, as if superposition is possible?

JACOB: So the types of tensor networks we’re working with, and the particular way of taking those tensor networks and getting probabilities to represent a distribution, it’s in this paradigm — this Born Machine paradigm, which is a reference to the Born rule of quantum mechanics — and so when you do that, you actually are allowing for superposition, but the idea is that this full space of quantum probability distributions actually contains all of the regular classical probability distributions as a special case.

LAURA: Yeah.

JACOB: And so you’re basically working in a larger state space. So if, somehow, this idea of superposition was helpful, we can learn that. Our model can learn that. But if wasn’t helpful — you know, no matter either way, because we can learn some other representation of the probability distribution governing that data.

And so anyways, so these ideas — I mean, people in quantum mechanics have talked about these sorts of philosophical concerns for a while — but the reality is that there is randomness in the real world, and if you use a random quantum mechanical system, you can still reproduce that type of randomness by using certain choices of your quantum states.

LAURA: Does the uncertainty principle apply to tensor networks?

JACOB: Um… Yes, it would, but the uncertainty principle as it’s usually concerned, or as it’s usually formulated, is about position and momentum, which… you have to have dynamics, and a particular kind of dynamics, for that to make sense. And in our case, we’re talking about… like, the project I’m working on right now is using these models to tackle language data. And of course, you can’t talk about position and momentum with words, because that’s even more bananas.

LAURA: (Laughs)

JACOB I mean, it would be cool if we could. It would be cool to say, “Oh, I have this model, and I’ve, you know, figured out some uncertainty principle of language.” But right now, I mean… that requires extra mathematical structures that don’t really make sense directly in the context of language or image or audio data.

LAURA: So what dataset are you working on now?

JACOB: We’re actually working with a synthetic dataset now, which is just bit strings that are constrained in a particular way to have a definite overall parity, so that the number of ones that appear in it is always even, which is considered a hard problem for a lot of restricted Boltzmann machines or neural network models to capture. And one of the people that we’re collaborating and working with has basically modeled this problem for using a slightly different kind of tensor network and found that it has pretty good performance compared to neural networks and RBMs [3].

When RBMs of various sizes try to generate bit strings with even parity of length 3 (left) or 4 (right), they often don’t even to do better than random guessing: there are plenty of blue dots lower than 1 on the y-axis of these graphs (from “Expressive Power and Approximation Errors of Restricted Boltzmann Machines” by Guido Montufar, Johannes Rauh, and Nihat Ay at NeurIPS 2011).
On the other hand, when tensor networks with various bond dimensions try to generate even-parity bit strings of length 20 (*20*! That’s much bigger than 3 or 4!), they almost always do better than random guessing: the blue and orange lines both manage to stay below 20 on the y-axis of this graph (from “Probabilistic Modeling with Matrix Product States” by James Stokes and John Terilla on arXiv 2019 [3]).
Jacob (right) is now commuting to New York for one week every month in order to work closely with John Terilla at the CUNY Graduate Center (left), who published the experiments in the second figure above.

So right now, I’m just working with toy data, but the idea is to start looking at the Penn Treebank corpus of sentences as a first point for just working with regular language modeling.

But, yeah. The dream would be going to larger and larger natural language processing datasets, but you know, right now — since it’s new models, and I’m building everything from scratch — it’s, you know, starting small.

LAURA: From what we know right now, if you were to design a task that would show off the advantages of tensor networks as well as possible — if you wanted to get the biggest possible gap in outcome between them and normal methods — what would that task be?

JACOB: Um, the task —

LAURA: It can be totally contrived, totally synthetic —

JACOB: Yeah, yeah. Well, I think… So actually, the parity problem that I just mentioned, the idea that you’re outputting this bit string that looks completely random — like, if you look at any section of it, it’s actually uniformly random data — but it has this constraint to it that if you sum up all the bits in this, you always get an even number. And that’s something — that sort of nonlocal pattern or nonlocal structure in the data is something that I believe people have had difficulties learning easily with neural networks, because it’s not… you know, if you remove any one of these bits, or any piece from the dataset, basically, there’s no signal to learn from. It’s completely random. And the surprising thing is that these tensor networks are better at capturing algebraic-type structure. So exact relations like that — where you say, “Oh, if I sum up all these individual pieces, I always get this thing” — that’s actually easier for these models to capture than regular neural networks. So, something like that. And the hope is that with grammar, within natural language — since grammatical structure has a little bit of a similar character, that there are non-local relations that hold between different pieces of a sentence — that the way in which those individual pieces compose has a little bit of this algebraic character. Like if you build a syntax tree, that looks a little bit like a computation tree or an expression within some algebraic language.

LAURA: Give us an example of a sentence where two words that aren’t adjacent have an important non-local relationship that affects the meaning.

JACOB: Um… It’s sort of present throughout language, but if you had “The dog which goes to the park is happy,” or “A dog which goes to the park is happy,” it’s the dog that’s happy. But if you just looked at local pieces of the sentence, you could see “the park is happy” —

LAURA: I don’t do NLP, but even I know this non-locality issue is a real problem because I’m used to hearing people like Dima and Yikang and Shawn complain about it. What is it about tensor networks that makes them better for this? If you were to explain why they work for this, how would you do that?

[Mila PhD student Dima Bahdanau was featured previously on the Mila.Quebec blog. Mila PhD students Yikang Shen and Shawn Tan are the first and second authors on “Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks,” with Microsoft researcher Alessandro Sordoni and Mila professor Aaron Courville, which won the Best Paper Award at ICLR 2019.]

Yikang Shen and Shawn Tawn, two of Mila’s most formidable NLP bros, discuss Yikang’s latest work on an ordinary Monday.
In the experiments which won Yikang and Shawn the Best Paper Award at ICLR 2019, they designed a neural network to read sentences from the Wall Street Journal and parse them in the same way that professional linguists do. But even though they beat the state of the art for parsing many types of “long-term dependency” (the situation which Jacob illustrated with his dog-in-the-park example, where a phrase comes in between two related words), their model still made some mistakes. In the second row above, Yikang and Shawn’s model (left) does the same thing as the human experts (right) when it links“75.3 million” to “interest expense,” even though they’ve been separated by a four-word prepositional phrase. But in the first row, it fails to detect that “4.5 billion” is modifying “bonds,” even though there’s just one word in between them.

JACOB: Yeah, yeah. So I think it would be because… what we’re doing with our particular… So, our particular flavor of tensor networks that we’re working with is called matrix product states, and it’s the idea that each word in a sentence would be associated with a matrix, and when you evaluate a sentence — you want to say, “OK: Is this something that would actually occur in my dataset?” — you associate each word with a matrix, and then you multiply all the different matrices together. And within physics and within mathematics, it’s well-known that matrix representations are very, very expressive, within many different algebraic structures. So within physics, people often look at symmetries of some real-world system, and the idea of a symmetry is abstract. It’s the idea of doing a rotation or some sort of transformation of your viewpoint. But when people actually want to do anything useful with this, they can take these abstract transformations and represent them concretely with matrices. There are a lot of established ways of taking one large matrix and breaking that into many, many smaller pieces which each have their own roles and each handle the representation of different parts of the structure. And so for us, the hope is that all of these different elements of grammar — and elements of semantics, and understanding facts about the real world — can be separated into different parts of a matrix. And then when we do this automatic matrix multiplication, what’s actually going on under the hood is the evaluation of many different pieces of structure, coding many different facets of natural language.

LAURA: But is it… Am I seeing the entanglement principle here? I mean, is it… By finding the connection between the two separate words, are we treating them like entangled particles in a quantum system? Does the metaphor work at that level, or not?

JACOB: I think it could. I mean, it’s still… We’re not entirely certain how strong the idea of entanglement is really going to be.

LAURA: “We” as physicists, or “we” as machine learning scientists?

JACOB: “We” as the few people at Mila who are working on this, the people I’ve talked to about it. But I think if you were to have the idea of entanglement happening in language, it really would be this idea that you can have… One way of talking about entanglement is it’s the idea that you can say something definite about a collection of several pieces, but if you look at any one of them individually, you lose that structure. You really need to observe the joint behavior to actually capture the entanglement. And in language, I think that’s definitely the case, that if you just break a sentence up into individual pieces, it’s very easy to create sentences where smaller pieces look like they have different meanings, but when you actually zoom out to take the entire thing into context, the overall meaning that comes from that is very, very different than any of the individual pieces.

But at the same time, you have similar types of effects and similar types of phenomena with just regular probability distributions. If we could have two coins that were completely correlated, just by virtue of… like, if somebody flipped a coin, and they didn’t show either of us what it was, and they just took that coin and put another one on the table that was the same way and gave it to each one of us, neither one of us would be able to guess beforehand what our coins were, but we could still know that it was the same value on both coins. So this whole idea of what’s genuinely quantum entanglement versus what’s things that are just properties of randomness is a little bit of a tricky subject within quantum physics.

And so the idea of making that separation within natural language data would be weird. But I do think you can have the idea of superposition, and the idea of entanglement. I’m hopeful that these are going to have interesting implications or interesting interpretations within this totally different context of natural language processing.

One of the things we’re sort of very speculatively looking at right now is the idea of superposition and relative phase between different states as being a little bit similar to, like, a similarity metric in terms of how replaceable or exchangeable two words are in a sentence.

LAURA: That’s so cool!

JACOB: Yeah! So I have no idea if this is actually going to work. This is something I just started thinking about a couple of weeks ago. I have a lot of stuff on my plate to get through before I can even give that a lot of thought. But the idea would be, like, synonyms versus antonyms would be something like relative phase in the sense of quantum mechanics, where there’s the idea of constructive versus destructive interference.

So normally, in regular, real-world situations — Schrödinger’s cat, for example, or… Schrödinger’s cat is not a real-world situation…

LAURA: (Laughs)

JACOB: In real-world situations, you only have one state of reality, but the thing about quantum mechanics is you can look at taking two states of reality simultaneously and mixing them together and seeing if the mixture of the two has a constructive effect between the two that the collective signals amplify, or if these two possibilities happen to cancel each other out.

And so… that was a bit of tangent, but the reason I was saying this is because it would be like in regular sentences, you only have individual sentences with individual words, and if you want to look at how exchangeable two different words are, you compare all of the sentences in which this one word occurs, and all of the sentences in which this other word occurs, and see if there’s a lot of overlap between them, in the sense of exchanging the two. And in our case, that’s actually something you can probe more directly by looking at the phase between these two words, in the sense that if you exchange them, will you get constructive interference, meaning that they’re more synonymous, or destructive interference, meaning that they’re more… “antonymous”? I don’t know….

LAURA: Sure, why not.

JACOB: Yeah.

LAURA: So if my mother is listening to this, and she’s a lawyer in Wisconsin and didn’t take math beyond algebra — not a scientist in any way —

JACOB: Hi, Laura’s mom!

LAURA: What’s an example of a context where she might interact with these ideas, or the results of these ideas? How would this touch on her life?

JACOB: Hmm… Oof.

LAURA: Like, can you imagine a technology that might be built from this that would solve a problem?

JACOB: Yeah, taking existing APIs… You can almost think of it like, “We want a machine learning model that learns some probability distribution.” I’m looking at a different way of doing that, under the hood. It’s a radical departure from traditional methods, but at the same time, it’s supposed to achieve the same results. And if you want to assess performance on datasets or things like that, the benchmarks that we use are going to be exactly the same as they are other places. So, I think… I mean, this is sort of a cop-out answer, but it’s — anybody doing really theoretical research, I feel, often has similar things — it’s that any of the things that real people’s lives would be changed by with machine learning methods, any of those applications could potentially be applications of this family of models. It’s just the hope is that whatever those bright, shiny, promising applications are in the future, if we find techniques that really work with this family of models, we can bring that future a little bit closer to actually being the present. So, I don’t know. I mean —

LAURA: No, that’s fair.

JACOB: Yeah, yeah.

LAURA: How many people at Mila are going rogue like you?

JACOB: Um…. Pretty much just… So I work for Guillaume Rabusseau, and the people in his group are… some of us are interested in different aspects of this.

LAURA: Total, we’re talking about between five and ten people, out of a couple hundred at Mila, who do this?

JACOB: Oh, I would say on the range of two to five at Mila.

LAURA: Because there’s you, there’s Guillaume. I think Doina coauthored a paper with Guillaume at one point? [4]

[The paper in question [4] is called “Connecting Weighted Automata and Recurrent Neural Networks Through Spectral Learning,” and it was published at AISTATS in 2019. Mila professor Guillaume Rabusseau, who is Jacob’s advisor, was the lead author. Mila PhD student Tianyu Li and Mila professor Doina Precup were the two coauthors.]

JACOB: Yeah, yeah.

LAURA: And then there’s Tianyu… So, very small minority here….

JACOB: Yeah, so four, five… and growing.

Mila PhD student Tianyu Li (left) and Mila professor Guillaume Rabusseau (right) discuss a matrix product state (notice the “MPS” in the second line), which is the same kind of tensor network that Jacob is working with. Fun Fact: This is the only blackboard on the new Mila premises. It was specially installed for Guillaume in his office.
In Guillaume and Tianyu’s AISTATS 2019 paper with Doina [4], they showed two non-language tasks where a standard neural network (orange) didn’t learn as quickly as tensor networks methods (all other colors). In these graphs, the standard network’s MSE, which represents the gap between the network’s chosen answer and the right answer on each training trial, tends to go down more slowly as the number of training trials goes up, and in the end, it doesn’t drop as far. Meanwhile, a new tensor network method introduced by Guillaume’s team (light blue) did better than the other tensor networks methods did (red, dark blue, green, purple), often achieving the lowest MSEs of all, with the least amount of training, in between 100 and 1,000 trials.

LAURA: We have a ton of new students at Mila right now. Is your cult recruiting? If you were recruiting, what you would tell them in order to convince them to join you, instead of sticking with more normal neural networks?

JACOB: Uh, our Kool-Aid is more fun. (Laughs) For me, I’ve always been a big math person, and even though I studied physics, math has always been my guiding star throughout the whole thing. One of the big selling points, I think, for this area of research is that the math is really pretty. When I first learned about neural networks, before I got into machine learning, I actually was sort of repelled, like, “Oh, I guess they work — but it seems ugly.” This whole idea of having weights and nonlinearities, weights and nonlinearities, you could build powerful models that way, but if you try to actually analyze it, it takes a lot, a lot of work, and, I think, a lot of techniques that are different from what people normally learn in the physical sciences or in engineering or things like that. So I think one of the things that’s nice about this is that if you have a mathematical inclination, these models are very pretty and very nice to work with, and a lot of the machinery that you would learn in the context of engineering, or physics, or any of the other mathematical sciences, you can apply to these things directly, without having to make a bunch of assumptions first.

So far, also, another point to make on this is that almost all of the work that’s been done has been physicists. Because these models, they come from physics. The type of math you use to reason about them is very different that what people normally use to reason about neural networks.

And just as a… sort of mathematical correspondence, in regular neural networks, people often use concatenation: they’ll take two signals from different places, concatenate them together, and act on them jointly with some weight matrix or something. And in our case, we actually use what’s called an outer product, or the tensor product, which is different than concatenation but still represents the idea of taking two pieces of information and representing them jointly within one space. If concatenation is like summing the dimensions of the state space, the outer product multiplies the dimensions, and so if you get many pieces of data that are being processed, you have make sure that you’re not… you have to keep using tricks to avoid having each piece multiply the size of your state space, because if you multiply something a bunch of times, you just get exponential blowup.

LAURA: What about zeros?

JACOB: Um, zeroes… Yeah, you need to make sure there’s never… that anything that you process never has any zeros in it, and of course, real data does. So you have to preprocess your data a little bit to represent it in a way that isn’t going to mess up your signal.

But, yeah. The downsides of this… I mean, you don’t get anything for free. If you’re looking for arbitrary expectations, you actually have to express that as a power-series-type approximation. So you can’t compute arbitrary things efficiently and exactly. And also because this is still a weird architecture from the standpoint of what people have been looking at in machine learning, there’s a lot of finicky parts of the model that people just haven’t really looked at before. And so we’re having to go through and take these things that, mathematically, are complex but still straightforward enough, and say, “Oh, OK. When I actually put this on a computer, this is the first thing that blows up,” in terms of, you know, you run the model, and you just get of a bunch of NaNs as output. And so you have to say, “OK: Why is this exploding? What can we do to avoid this?” These models are much more sensitive to exponential blowup or decay of your internal representation of your data.

LAURA: Why Mila? You’re not Canadian. You had to leave your country to come here. Why Mila?

JACOB: (Laughs) That was actually… That was a very easy question, because when I first came here, I thought I was just going to do a temporary stint here, hang out a little bit, learn some stuff, and then turn around and get some fancy job at Google, or something like that. And what actually happened was I came here, and I started doing the research, and I really fell in love with both the research itself and also the Mila community. I feel like there’s a lot of openness in the culture to wide-ranging discussion, like a really… a much less hierarchical sort of arrangement of peoples than what I’d been used to in academia.

LAURA: But for that temporary stint, you could have gone into orbit around any number of US institutions. You must have picked Mila for a reason.

JACOB: Oh yeah, and it was not a deep reason that I well-researched. It was that I have a friend of a friend who was a grad student at Mila, who I contacted and, you know, asked him about it, and he was like, “Oh, yeah. Just email Yoshua. He’s a cool guy,” and I didn’t know Yoshua was such a famous guy at the time, so I was like, “Oh! ‘Hey Yoshua! I’m the…’” — I mean it wasn’t actually like that, but you know, it was just basically…

LAURA: (Laughs)

JACOB: Basically, I cold-emailed him. And to his credit — me being this guy who was like, “Hey, I have no ties to this community right now. I’m a recovering physicist who’s interested in this stuff…”

LAURA: (Laughs) “Recovering…”

JACOB: “… Help a brother out.” Yoshua emailed me back about that. This guy who within a year won the Turing Prize and all this stuff, and that was… I feel like it was almost sort of meant to be, in that sense. If it had been a different institute, with a different culture, I feel like the head — like the head — of the whole lab would never have replied to someone like a nobody, like me at that time. I dunno. I think if Mila hadn’t been the place that it actually is, I wouldn’t have ended up coming here, even though I did, like, zero research before about,“Oh, do I want to go here, or to the Vector Institute, or somewhere else?”

LAURA: So if our listeners want to learn more about what we’ve been talking about — what papers should they read?

JACOB: (Laughs)

LAURA: … and what should they type into Google?

JACOB: Um. So, I think the…. Fff. Good question. So, they could look up… so the original paper that — at least in my eyes, the original paper that sparked this — actually, the two original papers… One was by this physicist named Miles Stoudenmire, who I mentioned before [1], and then also there was a paper called “Exponential Machines” [2] which was written from a very different perspective, and sort of a non-physics-y take on it, but under the hood, the actual mechanics that were going on were basically the same in both of these papers. Different applications, slightly different choices in design, but essentially the same model.

And then there’s a lot of good introductions to tensor networks from the standpoint of physics. I would encourage people to look at those. There’s a set of notes called Handwaving and Interpretive Dance: An Introduction to Tensor Networks [5].

LAURA: (Laughs)

JACOB: I can share the actual links with you. I’m probably butchering the names. There’s a lot of nice, beginner-friendly introductions to tensor networks. But again, the problem with that is it’s all written with the mindset that these are… a particular way of applying these to quantum mechanical systems, whereas in reality the mathematical formalism is much more general than that. So, yeah. Check it out.

LAURA: Cool. Thanks!

In next week’s episode, we’ll learn about causality in reinforcement learning with Mila PhD student Amy Zhang.