Memorizing vs. Understanding (read: Data vs. Knowledge)

Original article was published on Artificial Intelligence on Medium

Memorizing vs. Understanding (read: Data vs. Knowledge)

So how could I get the result of the arithmetic operation e?

e = 3 * (5 + 2)

Well, there are two ways: (i) if I’m lucky, and lazy (think: efficiency) I could have the value of e stored (as data) in some hashtable (a data dictionary) where I can use a key to pick-up the value of e anytime I need it (figure 1);

Figure 1. A data dictionary with key and value of arithmetic expressions.

(ii) if I do not have that option then the only other alternative to get the value of e is to actually compute the arithmetic expression and get the corresponding value. The first method, let’s call it the data/memorization method, which does not require me to know how to compute e while the second does. That is, in using the second method I (or the computer!) must know the procedures of addition and multiplication, shown in figure 2 below (where Succ is the ‘successor’ function that returns the next natural number).

Figure 2. Theoretical definition of the procedures/functions of addition and multiplication

That is, if the value of e is not memorized (and stored in some data storage), then the only way to get the value of e is to know that adding m to n is essentially adding n 1’s to m and knowing that multiplying m by n is adding m to itself n times (and thus ‘multiplication’ can be defined only after the more primitive function ‘addition’ is defined). Crucially, then, the first method is limited to the data I have seen and memorized (i.e., stored in memory), while the second method does not have this limitation — in fact, once I know the procedures of addition and multiplication (and other operations) then I’m ready for an infinite number of expressions. So we could, at this early juncture, describe the first method by “knowing what (is the value)” and the second method by “knowing how (to compute the value)” — the first is fast (not to mention easy) but limited to the data I have seen and memorized (stored), and the second is not limited but requires knowledge (knowing how). The first, if you like, is data-driven, and the second, is knowledge-based (Do these terms sound familiar?). The inquisitive reader must have realized by now the real reason behind the superiority of the second method, besides the fact that it is not limited to what I have seen and memorized/stored. That crucial difference is that the second method subsumes the first, but the converse is not true — that is, if I know how the I know what, or if I know how to compute the value then I can always store/save it, but knowing the value that is stored somewhere does not mean I know to compute it!

Having the knowledge (procedure, algorithm) to perform some computations implies (or subsumes) having access to the (data) result of the computation, but the reverse is clearly not true.

If this was the end of our story then there would be no debate in AI as to which approach (between the data-driven and the knowledge-based), is the right one, since the data-driven approach — thus far — seem to be very limited to what we have seen and would require us to memorize and store potentially an infinite number of values, which is of course implausible. But the story is not, of course, that simple.

But What if I Saw Lots and Lots of Examples?

If the data-driven (‘memorize and store’) paradigm was that simple it would not have gained such attention — even becoming the dominant computing paradigm! On the contrary. The data-driven approach has a slick, coherent and seemingly intuitive story to tell, and it goes like this: if I saw lots and lots of arithmetic expressions, coupled with their values, I would, over time, and using a ‘smart’ algorithm, figure out the pattern of computation and I would thus be able to get the values of unseen expressions from there on. In short, I would essentially learn (from the data) how to do addition and multiplication without knowing the details of the corresponding procedure/function — or rule (by the way, the definitions of the functions given above are essentially rules. For example, the definition of ‘addition’ could’ve been written as shown below, where (a => b) is read as if a then b):

But can this be achieved? Can we ‘learn’ (or discover) procedures of additions and multiplications by seeing enough expression and their values? Well, yes and no, depending on the type of data and on the precision required.

Data-Driven Approaches Learn how to APPROXIMATE functions

Yes, we could learn a procedure/function from seeing lots of examples, but this ‘yes’ has many qualifications. First, if the object under consideration is an infinite object, then the best we can hope for is an approximation, and not the exact function. Second, the only functions we can approximate are continuous functions because what we will essentially learn is to find some hyperplane in an n-dimensional space that would cover all (most!) the data points we have seen, and — and that’s the catch — it would luckily be a hyperplane that also covers new and unseen data points. In simple terms, what we essentially ‘learn’ is an approximation of an infinite hashtable (data dictionary) like the one we showed in the introduction. That virtual infinite data dictionary would essentially be stored in the weights of our optimized network. But, again, what we learned is an approximation of our functions, and not the exact functions themselves. And the how to (the knowledge) is not stored as weights in the network (the parameters) — weights that have been optimized by our favorite learning algorithm (say, Backprop/SGD). Still, that is an impressive accomplishment.

So, Where is the Problem?

There are two problems. The data-driven approach essentially memorized (let’s say learned) how to compute the value of new expressions based on lots of examples it saw. But here the limitations of data/memorization paradigm:

(1) when dealing with infinite domains, we could never memorize (in the weights) an infinite function, and thus all of our computations are approximations. (of course, if we could have an infinite amount of time and infinite time we could compute anything, right? 🙂 )

(2) As a corollary of the above, new and unseen expressions that will be given to the network will not be ‘really’ computed, they will be approximated. If we were doing image or speech recognition, we can live with some noise, but in many computations we expect exact results and so no approximation can be tolerated, and thus this model of computations cannot be used.

Yes, infinity is the villain of data-driven approaches. That is why I labelled the data-driven approach to natural language understanding by the “chasing infinity” paradigm, because they are essentially trying the futile approach of memorizing an infinite object, namely language! (how many gigabytes and GPUs did the latest GPT-3 use? By the time GPT-2000 is out we will have ML-induced carbon emission.

Now how is this Relevant to AI and Understanding?

Now here’s the implication of all of the above to AI (beyond the militations of the data-driven approach in infinite domains). I claim that

no student can really understand how 30 * (21 + 5) comes out to 780, without knowing what the answer to, 3 * (4 + 60), for example, is.

Because you understand how addition and multiplication work, you can always compute the result of any arithmetic expression, and ‘any’ here refers to the infinite number of arithmetic expressions we might encounter. Now why is that important? Prominent cognitive scientists, logicians and philosophers of science have long argued that human thought (and by extension) language, is systematic — or, that systematicity is a property of language/thought and that property is related to compositionality. Specifically, these scientists suggest that no one can entertain the thought (or equivalently, understand the meaning of) John loves Mary, for example, without being able to entertain the thought (or equivalently, understand the meaning of) Mary loves John, and John loves John, etc. Much like the case of arithmetic expressions, if one understands how the composite thought (or the composite meaning) of John loves Mary was determined from the components and the specific structure of the composition, then they can apply the same procedure on another construction that has the same (types of) components. On the other hand, if the meaning of John loves Mary was simply stored somewhere (i.e., memorized — seen in the data), then that does not guarantee that we have access to the meaning of Mary loves John. In other words, if the paradigm is not to understand how X, but to see/observe and memorize X, then we can never hope to ‘understand’ language this way. Understanding language means knowing how, and knowing how in finite domains requires a procedure/a rule.

Well, (if you read this article) I hope you acted like the intelligent beings we are (as opposed to a neural network) and you did not memorize any data, but got the point of the knowledge I think i was conveying 🙂