Time to put an end to BERTology: (or, ML/DL is not even relevant to NLU)

Original article was published by Walid Saba, PhD on Deep Learning on Medium

Time to put an end to BERTology: (or, ML/DL is not even relevant to NLU)


There are 3 technical (read: theoretical, scientific) reasons why the data-driven/quantitative/statistical/machine learning approaches to natural language understanding (NLU) are futile efforts. I will collectively refer to these (essentially similar approaches) by what has been brilliantly referred to as BERTology. This is a big claim, I understand, especially given the current trend and the massive amount of money that the tech giants are spending on this flawed paradigm. As I have repeated made this claim in my writings (either publications or posts on this blog), I have been often told “but could all of these people be wrong?”. Well, for now I will simply reply with “yes, they could indeed all be wrong”. I say that armed with the wisdom of the great mathematician/logician Bertrand Russell who once said

The fact that an opinion has been widely held is no evidence whatsoever that it is not utterly absurd

Before we begin, it is important to emphasize that our discussion is directed to the use of BERTology in NLU, and the ‘U’ here is crucial — that is, and as will become obvious below, BERTology might be useful in some NLP tasks (such as text summarization, search, extraction of key phrases, text similarity and/or clustering, etc.) because these tasks are all some form of ‘compression’ that machine learning can be successfully applied to. However, we believe that NLP (which is essentially just text processing) is a completely different problem from NLU (you may want to read this short article that discusses this specific point).

So, to summarize our introduction: the claim that we will defend here is that BERTology is a futile effort to NLU (in fact, it is irrelevant) and that this claim is not about some NLP tasks, but is specific to true understanding of ordinary spoken language, the kind of which we do on a daily basis when we engage in dialogues with people that we don’t even know, or with non-experts and even young children that do not have any domain specific knowledge!

Finally, I will make the final argument (hopefully proof) against the utility of BERTology in NLU after I discuss a few phenomenon. Once that is done, the final step will be easily accomplished.

Now we can get down to business.

MTP — the Missing Text Phenomenon

Let us start first with describing a phenomenon that is at the heart of all challenges in natural language understanding, which we refer to as the “missing text phenomenon” (MTP).

Linguistic communication happens as shown in the image above: a thought is encoded by a speaker into some linguistic utterance (in some language), and the listener then decodes that linguistic utterance into (hopefully) the thought that the speaker intended to convey! It is that “decoding” process that is the ‘U’ in NLU — that is, understanding the thought behind the linguistic utterance is exactly what happens in the decoding process. And this is precisely why NLU is difficult. Let’s elaborate.

In this complex communication there are two possible alternatives for optimization, or for effective communication: (i) the speaker can compress (and minimize) the amount of information sent in the encoding of the thought and hope that the speaker will do some extra work in the decoding (uncompressing) process; or (ii) the speaker will do the hard work and send all the information needed to convey the thought which would leave the listener with little to do (see this article for a full description of this process). The natural evolution of this process, it seems, has resulted in the right balance where the total work of both speaker and listener is optimized. That optimization resulted in the speaker encoding the minimum possible information that is needed, while leaving everything else that can be assumed to be information that is available for the listener. The information we tend to leave out is usually information that we can safely assume to be available for both speaker and listener, and this is precisely the information that we usually call common background knowledge.

To appreciate the intricacies of this process, consider the following (unoptimized) communication:

It should be very obvious that we certainly do not communicate this way. In fact, the above thought can be expressed (by a human to a human) this way:

This much shorter message conveys the same thought as the longer one (which is still not fully compressed, by the way!) because we all know

That is, for effective communication, we do not say what we can assume we all know! This is also precisely why we all tend to leave out the same information — because we all know what everyone knows , and that is the “common” background knowledge. This genius optimization process that humans have developed in about 200,000 years of evolution works quite well, and precisely because we all know what we all know. But this is where the problem is in AI/NLU. Machines don’t know what we leave out, because they don’t know what we all know. The net result? NLU is very very difficult, because a software program can only fully understand the thoughts behind our linguistic utterances if they can somehow “uncover” all that stuff that humans assume and leave out in their linguistic communication. That, really, is the NLU challenge (and not parsing, stemming, POS tagging, etc.) In fact, here are some well-known challenges in NLU, with (just some of) the missing and highlighted text, showing the reason why we have these challenges in NLU:

All the above well-known challenges in NLU are due to the fact that the challenge is to discover (or uncover) that information that is missing and implicitly assumed as shared and common background knowledge.

Now that we are (hopefully) convinced that NLU is difficult because of MTP — that is, because our ordinary spoken language in everyday discourse is highly (if not optimally) compressed, and thus the challenge in “understanding” is in uncompressing (or uncovering) the missing text, I can state the first technical reason why BERTology is not relevant to NLU.

BERTology is not relevant to NLU (1)

The equivalence between (machine) learnability (ML) and compressibility (COMP) has been mathematically established. That is, it has been established that learnability from a data set can only happen if the data is highly compressed (i.e., it has lots of redundancies) and vice versa. But MTP tells us that NLU is about uncompressing. What we now have is the following:

End of proof 1.

Intension (with an ‘s’)

Intension is another phenomenon I want to discuss, before I get to the second proof that BERTology is not even relevant to NLU. I will start with what is known as the meaning triangle, shown below with an example:

Thus every “thing” (or every object of cognition) has three parts: a symbol that refers to the concept, and the concept has (sometimes) actual instances. I say sometimes, because the concept “unicorn” has no “actual” instances, at least in the world we live in! The concept itself is an idealized template for all its potential instances (and thus it is close to the idealized Forms of Plato!) You can imagine how philosophers, logicians and cognitive scientists might have debated for centuries the nature of concepts and how they are defined. Regardless of that debate, we can agree on one thing: a concept (which is usually referred by some symbol) is defined by a set of properties and attributes and perhaps with additional axioms and established facts, etc. Nevertheless, a concept is not the same as the actual (imperfect) instances. This is also true in the perfect world of mathematics. So, for example, while the arithmetic expressions below all have the same extension, they have different intensions:

Thus, while all the expressions evaluate to 16, and thus are equal in one sense (their VALUE), this is only one of their attributes. In fact, the expressions above have several other attributes, such as their syntactic structure (that’s why (a) and (d) are different), number of operators, number of operands, etc. The VALUE (which is just one attribute) is called the extension, while the set of all the attributes is the intension. While in applied sciences (engineering, economics, etc.) we can safely consider these objects to be equal if they are equal in the VALUE attribute only, in cognition (and especially in language understanding) this equality fails! Here’s one simple example:

Suppose that (1) is true — that is, suppose (1) actually happened, and we saw it/witnessed it. Still, that does not mean we can assume (2) is true, although all we did was replace ‘16’ in (1) by a value that is (supposedly) equal to it. So what happened? We replaced one object in a true statement by an object is supposedly equal to it, and we have inferred from something that is true something that is not! Well, what happened is this: while in physical sciences we easily replace an object by one that is equal to it with one attribute, this does not work in cognition! Here’s another example:

We obtained (2), which is ridiculous, by simply replacing ‘the tutor of Alexander the Great’ by a value that is equal to it, namely Aristotle. Again, while ‘the tutor of Alexander the Great’ and ‘Aristotle’ are equal in one sense, these two objects of thought are different in many other respects.

I’ll stop here with the discussion of what ‘intension’ is and why it’s important in high-level reasoning, and specifically in NLU. The interested reader can look at this short article where I have references there to additional material.

So, what is the point from this discussion on ‘intension’. Natural language is rampant with intensional phenomenon, since objects of thoughts — that language conveys — have an intensional aspect that cannot be ignored. But BERTology, in all its variants, is a purely extensional system and can only deal with numeric values only —and thus it cannot model or account for intension, and thus, it cannot model various phenomenon in natural language.

End of proof 2.

Statistical Significance (or, computational plausibility)

BERTology is essentially a paradigm that is based on finding some patterns (correlations) in the data. Thus the hope in that paradigm is that there are statistically significant differences between various phenomenon in natural language, otherwise they will be considered essentially the same. But consider the following:

Note that antonyms/opposites such as ‘small’ and ‘big’ (or ‘open’ and ‘close’, etc.) occur in the same contexts with equal probabilities. As such, (1a) and (1b) are statistically equivalent, yet even for a 4-year old (1a) and (1b) are considerably different: “it” in (1a) refers to “the suitcase” while in (1b) it refers to “the trophy”.

Let us see how many examples we need if one insists on using BERTology to learn how to correctly resolve “it” in such structures. First of all, in BERTology there is no notion of type (and no symbolic knowledge whatsoever, for that matter). Thus the following are all different:

That is, in BERTology there is not type hierarchy that allows us to generalize and consider ‘bag’, ‘suitcase’, ‘briefcase’ etc. as all subtypes of some ‘container’. Thus, each one of the above, in a purely data-driven paradigm, are different and must be ‘seen’ separately. If we add to the semantic differences all the minor syntactic differences to the above pattern (say changing ‘because’ to ‘although’ — which also changes the correct referent to “it”) then a rough calculation tells us that a BERTology system would need to see something like 40,000,000 variations of the above, and all of this just to resolve a reference like “it” in structures like the one in (1). If anything, this is computationally not implausible. As Fodor and Pylyshyn once famously quoted the renowned cognitive scientist George Miller, to capture all syntactic and semantic variations that an NLU system would require, the number of features a neural network might need is more than the number of atoms in the universe! (I would recommend for anyone interested in cognitive science this classic and brilliant paper — it is available here).

To conclude this section, often there is no statistical significance in natural language and the different interpretations are not due to anything that is in the data but to information that is available elsewhere (if x does not fit in y then larger(y, x) is more likely than larger(x, y), etc.) In short, the only source of difference in interpretation in BERTology must be found in the data, but very often that difference is not even in the data and you cannot find what is not there.

End of proof 3


I have discussed in some details in this article three reasons that make BERTology even irrelevant to NLU (although they might be used in text processing tasks that are essentially compression tasks). Each of the above three reasons is enough on its own to put and end to this runaway train called BERTology.

Language, is not just data.