Source: Deep Learning on Medium
This post on how Deep Learning might be hitting its limitations written by Thomas Nield just made my day and pointing me in the right direction with regards to the first part of the paper I’m writing.
I want the introduction of the paper to be a sort of continuation of some of the ideas I put out there on this post and Nield has several links throughout his post that make my case against Explainability, and reaches conclusions in my line of work. Moreover, he is nice enough to reference very powerful literature. This post is a continuation (perhaps a perversion) of Nield’s thoughts.
I feel validated by Nield’s post and I too share his concern that perhaps another AI Winter is coming. In addition to Nield’s historical materialism, I wish to add a little flavor to his analysis on the matter of AI Winters, and go perhaps a little bit more mythological. I wish to apply the Golden Calf Syndrome as derived from my Golden Calf Paradox.
Particularly, I wish to focus on the YouTube video Nield embeds in his post as a template for my historical recount that adds minor details to the story told by Nield. The video from the 1960s and features two very sexy gentlemen: Claude Shannon (father of Information Theory) and Oliver Selfridge (creator of the Pandemonium Architecture).
Shannon pushes intelligent machines and is depicted playing checkers with a computer around minute 0:55, and at 1:40 you can see Selfridge in a display of techno-pundit rhetoric saying “I believe that machines will do, what men do when we say we are thinking”. From minute 2:02 the video features a machine capable of doing russian-english translation. Let’s revise these three moments.
A Bit of History
Computing machines have been around since the 17th century with inventions such as the Pascaline and in the 1800s with the Arithmometer. However, we must mind that computers and computing machines are different. Unlike fixed-purpose machine like the Pascaline or the Arithmometer, Alan M. Turing proposes a single Universal Machine (a Turing Machine) that can execute algorithms. Modern day computers are based on the ideas by Turing featuring a lot of other contributions. However, I wish to stress the point that Turing was concerned with formalizing a notion that has been around since the times of the Greeks (and even before): algorithms.
Turing’s construction is based on logic. His original problem was a logical constructive approach on the Entscheidungsproblem. He made his PhD thesis On Computable Numbers in 1936. His work is intertwined with the one of his best frenemy Alonzo Church. Their construction is formal, and its logical.
Turing is as UK war hero, his work on the Enigma Machine alongside with John Von Neumann was crucial during WWII. The story of an all-purpose computing machine that defeats Nazis was just begging for some follow-up, perhaps its own comic book. In his 1950 paper Computing Machinery and Intelligence, Turing takes a critical stance on the possibility of machines thinking and deconstructs (logically) any claims (he even includes a section on theology). Sadly, what is mostly remembered from that paper is the Turing Test for Artificial Intelligence.
Checkers playing Shannon (or the other way around)
When Shannon is depicted playing checkers, I refer the reader to the work of Arthur Samuel and particularly to his 1959 paper Some Studies in Machine Learning Using the Game of Checkers. Samuel’s paper details a strategy to break down the process of learning Checkers using decision trees, a strategy that can be done exhaustively.
Samuel’s approach is a rudimentary attempt on recommendation systems, and is related to Deep Blue (see the controversy surrounding Deep Blue vs. Kasparov). Samuel’s approach does not differ from the tradition of rationalist computation, but he does bring forth an issue: brute-force evaluation of all possible choices. An issue that is logically the best case scenario. See the following quote out of a paper by Marvin Minsky in 1960:
But it is essential to observe that a comparator by itself, however shrewd, cannot alone give any improvement over exhaustive search. The comparator gives us information about partial success, to be sure. But we need also some way of using this information to direct the pattern of search in promising directions;to select new trial points which are in some sense “like,” or“similar to,” or “in the same direction as” those which have given the best previous results. To do this we need some additional structure on the search space. This structure need not bear much resemblance to the ordinary spatial notion of direction,or that of distance, but it must somehow tie together points which are heuristically related.
Mr. Minsky will come around later again. He is crucial in destroying the Work of Oliver Selfridge.
Keeping Tabs on the Russians…
With regards to the russian-english automatic translation, see that Wikipedia in its article on AI Winter points out, that an episode in the AI Winter story is the Failure of Machine Translation. Here we must focus on the ALPAC report of 1966. Let’s concentrate on the following fragment:
Such research must make use of computers. The data we must examine in order to find out about language is overwhelming both in quantity and in complexity. Computers give promise of helping us control the problems relating to the tremendous volume of data, and to a lesser extent the problems of data complexity. But, we do not yet have good, easily used, commonly known methods for having computers deal with language data. Therefore, among the important kinds of research that need to be done and should be supported are (1) basic developmental re- search in computer methods for handling language, as tools for the linguistic scientist to use as a help to discover and state his generalizations, and as tools to help check proposed generalizations against data; and (2) developmental research in methods to allow linguistic scientists to use computers to state in detail the complex kinds of theories (for example, grammars and theories of meaning) they produce, so that the theories can be checked in detail.
This fragment points about the strongpoint of computers as information crunchers, yet it states that language is much more complex, a view held also by many linguists and cognitive scientists (e.g. Noam Chomsky and Marvin Minsky).
Even one of the Committee members J.J. Pierce (also a major contributor in Information Theory) said the following on Machine Translation:
I concur with your view of machine translation, that at present it serves no useful purpose without postediting, and that with postediting the over-all process is slow and probably uneconomical. As to the possibility of fully automatic translation, I am convinced that we will some day reach the point where this will be feasible and economical.
However, there is considerable basic knowledge required that we simply don’t have at the moment, and it is anybody’s guess how soon this knowledge can be obtained. However, I am dedicated to trying to obtain some of this knowledge. The question as to whether fully automatic translation will ever be economical must wait until we see whether it is possible at all. I feel that if it is possible, then it will be economical in the future because of the rapid advances in computer technology.
The techniques used in this first attempt are impractical, yet Natural Language Processing is pretty much alive nowadays, and has become an interdisciplinary field. Further revisions on the non-linearity and multiple hierarchies of interacting categories in language in addition to phenomena like ambiguity still constitute major constraints on the progress of Natural Language Processing.
Moreover, for every progress made from within the technical field of AI and NLP, cognitive scientists, linguists and so on find countless counterexamples and contradictions to existing theories.
Oliver Selfridge: Pundit Extraordinaire
Let’s observe now Oliver Selfridge’s techno-pundit moment around 1:40. I find it interesting that his paper Pandemonium: A Paradigm of Learning turns out to be a rudimentary presentation of present-day neural neural networks (a little less linear algebra perhaps), but its in the neighborhood of Deep Learning. Check the the introduction:
At the bottom the data demons serve merely to store and pass on the data. At the next level the computational demons or sub-demons perform certain more or less complicated computations on the data and pass the results of these up to the next level, the cognitive demons who weigh the evidence as it were. Each cognitive demon computes a shriek, and from all the shrieks the highest level demon of all, the decision demon, merely selects the loudest.
His paper is on the translation between Morse Code and Typewriting using a network of demons (present day artificial neurons) that work in a hierarchy. His schematic depiction of his Demon Organization (the pandemonium) resembles a lot of diagrams of Neural Networks:
At the end of the paper there is a very interesting transcript of feedback and opinions on Selfridge’s work, and a final reply from the man himself. There are several questions regarding some (mathematical) aspects of Selfridge’s demons.
The most incisive perhaps is with regards to a very technical problem informally dubbed as hill-climbing. Let’s see a part of Selfridge’s response to the feedback:
“I maintain that there is some merit in studying self-improvement systems and if I am going to do that I am going to study systems where I know what I want, rather than more difficult problems, however unattractive they may be to mathematicians. Mathematicians like to work on unsolved problems for the greater glory, and problems already solved like the prime number theorem are left to graduate students. I have a suspicion that John McCarthy might later bring up some important points about descriptions, and here I see my point about useful problems, because Morse code is a useful problem”
I assume that the John McCarthy refered here is the same John McCarthy that advocated for Mathematical Logic as a foundation of AI, and the developer of Lisp (check this Wikipedia article on him). We will get back to McCarthy later.
Now, to explain hill-climbing we must turn Selfridge’s pandemonium, into Pandemonium Mining Corp. The data demons chip data as miners take the ore from the earth. That ore is passed on to the computational demons, whose responsibility is to process this ore to separate the metal. Meanwhile, the cognitive demons do the metalworking that take the raw metal, melt it and shape it into a finished product. The decision demon at the top works as the market that buys the best quality finished product (for Selfridge: the demon that shrieks the loudest).
However, this best product is defined as a function of the market. Thus, the derivative of this best function allows the market to tell the metalworkers that they need to improve their product. They change their casting, molding methods in their production. In turn, metalworkers tell processors that they need better quality metal, and processors diligently refine their frothing, sublimation, melting methods, up until they are forced to tell the miners that they have to mine better ore. This optimization (training process) goes iteratively, until the market is satisfied with the finished product (a loss minimum is reached).
However, coming back to Marvin Minsky, he points out in his 1956 paper on Steps towards Artificial Intelligence that changes in the different departments of Pandemonium Mining Corp. is not excempt of complications:
Obviously, the gradient-following hill-climber would be trapped if itshould reach a local peak which is not a true or satisfactory optimum. It must then be forced to try larger steps or changes. Itis often supposed that this false-peak problem is the chief obstacle to machine learning by this method. This certainly can be troublesome. But for really difficult problems, it seems to us that usually the more fundamental problem lies in finding any significant peak at all. Unfortunately the known E functions for difficult problems often exhibit what we have called  the “Mesa Phenomenon” in which a small change in a parameter usually leads to either no change in performance or to a large change in performance. The space is thus composed primarily of flat regions or “mesas.” Any tendency of the trial generator to make small steps then results in much aimless wandering without compensating information gains
In terms of Pandemonium Mining Corp. This translates to three major issues: (1) changes any part of the operation of ore-to-market may not have any significant changes in the markets appreciation, (2) changing any part of the operation, at the expense of messing up our final product beyond market’s approval and (3) production units may wander aimlessly without knowing how to actually optimize.
As pure gossip, the paper I quote by Minsky features in the references of a paper by John McCarthy from 1959 titled Programs with Common Sense.
Though McCarthy’s comments on Selfridge’s work are undoubtfully flattering (even ventures to elaborate on Demon Zeitgeist within the Pandemonium, introducing pandemonium unconscious), I would take his comments not as flattery but as irony, and I make my case re-reading Selfridge’s response (I speak through my own wounds, I’ve had my fair share of debates with engineers).
In any case, Selfridge’s approach was later revised and criticized (see the criticism on this Wikipedia article). His work bears resemblence to another paradigm of learning: connectionism also highly contested. As Wikipedia points out, abandoning connectionism is one of those episodes in the AI Winter.
Under connectionism, learning is the result of self-organization pressed by the minimization of something (e.g. an error functional). The caveats that Minsky points out with hill-climbing are also valid for the Deep Learning paradigm (just my opinion). In addition, McCarthy’s own logical approach to AI and his language Lisp was a key component of AI tehcnologies in the 1980s (Lisp Machines) as expert systems (notice that IBM’s hot mess Deep Blue is a distant relative of Expert Systems).
Curiously, Marvin Minsky was left out of Symbolics Inc. a company that ventured into commercializing Lisp Machines (featuring specialized Lisp executing hardware). The crucial mistake of Symbolics Inc. made was devoting themselves to hardware instead of Software. However, Symbolics managed to pull back from the fire several other projects concerning for example Computer Graphics and Computer Simulation.
Finally: Some Hefty Calves
In his post, Nield points us to two works by Gary Marcus: (1) his critical appraisal of Deep Learning and (2) his post on the defense of skepticisim about Deep Learning. From the critical appraise, I would like to quote the following paragraph to explain exactly where the sinning begins for us in AI:
“ The dominant approach in deep learning is hermeneutic, in the sense of being self-contained and isolated from other, potentially usefully knowledge. Work in deep learning typically consists of finding a training database, sets of inputs associated with respective outputs, and learn all that is required for the problem by learning the relations between those inputs and outputs, using whatever clever architectural variants one might devise, along with techniques for cleaning and augmenting the data set. With just a handful of exceptions, such as LeCun’s convolutional constraint on how neural networks are wired(LeCun, 1989), prior knowledge is often deliberately minimized.”
Marcus introduces the term hermeneutics to the Computer Sciences community, a term that might be unfamiliar to the STEM crowd, but that is pretty mainstream in social sciences. We could infer from Marcus’ text that hermenutics is as dirty word. However, it has already been proposed, as an alternative specially in novel sciences such as Artificial Intelligence (see the works of Gordana Dodig-Crnovic, Peter Wegner and even Paul Feyerabend’s infamous Against Method). I think Marcus mis-uses the term, the word he is actually looking for is solipsism. The current approach of Deep Learning is deeply solipsistic as it presents us with the golden calf as new gods.
Though I regard the hype surrounding Deep Learning as extremely positive to dynamize the field, maximize funding and oxygenate computer sciences in general, I concur with Marcus when he says we need to chill, otherwise this is going to be a gazillion dollar worth downward spiral. As history has shown here and in Nield’s post, crackdowns on AI have always come from disciplines outside. In particular, cognitive sciences and mathematics have been a pain in the ass for the connectionist paradigm (now revamped as Deep Learning). Though I claim that a more structured approach to Deep Learning is needed and that the Epistemology of Deep Learning is just messed up. Something tells me that as in the case of Lisp, praying to the old god of logically based computation is not going to take us very far (who the hell uses Lisp?)
Particularly, Marcus points out something very important in his critical appraise, a confusion of causation and correlation:
“If it is a truism that causation does not equal correlation, the distinction between the two is also a serious concern for deep learning. Roughly speaking, deep learning learns complex correlations between input and output features, but with no inherent representation of causality. A deep learning system can easily learn that height and vocabulary are, across the population as a whole, correlated, but less easily represent the way in which that correlation derives from growth and development (kids get bigger as they learn more words, but that doesn’t mean that growing tall causes them to learn more words, nor that learning new words causes them to grow). Causality has been central strand in some other approaches to AI (Pearl, 2000) but, perhaps because deep learning is not geared towards such challenges, relatively little work within the deep learning tradition has tried to address it.”
Trueism, solipsism, hermenutics, those are words that go beyond the discourse of Science and venture into philosophy. AI as a working field cannot be left to just one tribe (logicians, information theoreticians, software developers, cognitive scientists, etc.). Its a team effort.
See that the Lisp project was totally grounded on classical computational approaches, as does A.Samuel and perhaps L.G. Valiant in his theory of the learnable (I did a post on that). However, that did not stop the the first AI Winter with the ALPAC report or the Lighthill Report of 1972. Mixing in elements of classical computation based on logic did not stop the atonement, such concotions as in the case of the Lisp Machines turn out to be false idols.
In addition, as for example happened with connectionism: the McCulloch-Pits neuron was cannibalized within the field, by virtue of an XOR function, and a lack of neurobiological foundation. Math and biological sciences have played the rol of the levites in past AI Winters. Revising the literature of Shannon, Selfridge and Von Neumann, there’s a far stretch from randomized events to intelligence. That Golden Calf of Connectionism and Pandemonia as the new god will bring forth another AI Winter.
Notice that in Exodus Chapter 32, it was anxiety that drove the israeli to construct a Golden Calf. Moses had left them, their line to God was lost. We cannot ignore the fact that the Video posted by Nield is classical Cold War Tactics, and that many computational Methods like Montecarlo, Operations Research Methods and Von Neumann’s automata are linked to the Manhattan project. In a world frightened by terrorism and decaying states and fragile finances, angst is in every corner.
In the story of the Golden Calf, Aaron (Moses’ brother) was the enabler in this delusion (he was the one that casted the Calf). As a scientist I tell my fellow scientist: don’t be an Aaron. On the contrary, on the story there’s one small line from Joshua (Ex.32:17), he says
“There is the sound of war in the camp.”
Also fellow scientist: don’t be a Joshua. In (Ex.32:18) Moses replies to Joshua
“It is not the sound of victory, it is not the sound of defeat;
it is the sound of singing that I hear.”
I was commenting with my husband the other day about this topic, he said a sentence that I think serves as a perfect ending for this very long and convoluted post:
the epistemological debate is served.