Source: Deep Learning on Medium
A brief history of Neural Networks
7 years. Think about that. It’s been just 7 years and we’ve absolutely revolutionized the way we look at the capabilities of machines, the way we build software, and the ways we think about creating products and companies (Just ask any VC or startup founder). Tasks that seemed impossible just a decade ago have become tractable, granted you have the appropriate labeled dataset and compute power of course. This just tells us about the power of data and how it has provided us with immense limitless possibilities.
Can Machines Think?
The question of whether it is possible for machines to think has a long history, which is firmly entrenched in the distinction between dualist and materialist views of the mind. René Descartes prefigures aspects of the Turing test in his 1637 Discourse on the Method when he writes:
How many different automata or moving machines can be made by the industry of man […] For we can easily understand a machine’s being constituted so that it can utter words, and even emit some responses to action on it of a corporeal kind, which brings about a change in its organs; for instance, if touched in a particular part it may ask what we wish to say to it; if in another part it may exclaim that it is being hurt, and so on. But it never happens that it arranges its speech in various ways, in order to reply appropriately to everything that may be said in its presence, as even the lowest kind of man can do.
Here Descartes notes that automata are capable of responding to human interactions but argues that such automata cannot respond appropriately to things said in their presence in the way that any human can. Descartes therefore prefigures the Turing test by defining the insufficiency of appropriate linguistic response as that which separates the human from the automaton. Descartes fails to consider the possibility that future automata might be able to overcome such insufficiency, and so does not propose the Turing test as such, even if he prefigures its conceptual framework and criterion. In 1936, philosopher Alfred Ayer considered the standard philosophical question of other minds: how do we know that other people have the same conscious experiences that we do? In his book, Language, Truth and Logic, Ayer suggested a protocol to distinguish between a conscious man and an unconscious machine: “The only ground I can have for asserting that an object which appears to be conscious is not really a conscious being, but only a dummy or a machine, is that it fails to satisfy one of the empirical tests by which the presence or absence of consciousness is determined.”(This suggestion is very similar to the Turing test, but is concerned with consciousness rather than intelligence. Moreover, it is not certain that Ayer’s popular philosophical classic was familiar to Turing.) In other words, a thing is not conscious if it fails the consciousness test.
Alan Turing and the Imitation Game
Alan Mathison Turing was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. Turing was highly influential in the development of theoretical computer science, providing a formalisation of the concepts of algorithm and computation with the Turing machine, which can be considered a model of a general-purpose computer. Turing is widely considered to be the father of theoretical computer science and artificial intelligence.
The Turing test, developed by Alan Turing in 1950, is a test of a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel such as a computer keyboard and screen so the result would not depend on the machine’s ability to render words as speech. If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test. The test results do not depend on the machine’s ability to give correct answers to questions, only how closely its answers resemble those a human would give.
The test was introduced by Turing in his 1950 paper, “Computing Machinery and Intelligence”, while working at the University of Manchester (Turing, 1950; p. 460). It opens with the words: “I propose to consider the question, ‘Can machines think?’” Because “thinking” is difficult to define, Turing chooses to “replace the question by another, which is closely related to it and is expressed in relatively unambiguous words.” Turing’s new question is: “Are there imaginable digital computers which would do well in the imitation game?” This question, Turing believed, is one that can actually be answered. In the remainder of the paper, he argued against all the major objections to the proposition that “machines can think”.
Since Turing first introduced his test, it has proven to be both highly influential and widely criticised, and it has become an important concept in the philosophy of artificial intelligence. Some of these criticisms, such as John Searle’s Chinese room, are controversial in their own right. Turing, in particular, had been tackling the notion of machine intelligence since at least 1941 and one of the earliest-known mentions of “computer intelligence” was made by him in 1947. In Turing’s report, “Intelligent Machinery”, he investigated “the question of whether or not it is possible for machinery to show intelligent behaviour” and, as part of that investigation, proposed what may be considered the forerunner to his later tests:
It is not difficult to devise a paper machine which will play a not very bad game of chess. Now get three men as subjects for the experiment. A, B and C. A and C are to be rather poor chess players, B is the operator who works the paper machine. … Two rooms are used with some arrangement for communicating moves, and a game is played between C and either A or the paper machine. C may find it quite difficult to tell which he is playing.
The Imitation Game
Turing’s original article describes a simple party game involving three players. Player A is a man, player B is a woman and player C (who plays the role of the interrogator) is of either sex. In the imitation game, player C is unable to see either player A or player B, and can communicate with them only through written notes. By asking questions of player A and player B, player C tries to determine which of the two is the man and which is the woman. Player A’s role is to trick the interrogator into making the wrong decision, while player B attempts to assist the interrogator in making the right one. Turing then asks: What will happen when a machine takes the part of A in this game? Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, “Can machines think?”
1950–1970 : The Dawn of Machine Learning
In 1951, Marvin Minsky and Dean Edmonds build the first neural network machine, able to learn, the SNARC. However it was in 1957 that the perceptron model was developed. Frank Rosenblatt invented it while working at the Cornell Aeronautical Laboratory. The invention of the perceptron generated a great deal of excitement and was widely covered in the media.
The term machine learning was coined in 1959 by Arthur Samuel. Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” This definition of the tasks in which machine learning is concerned offers a fundamentally operational definition rather than defining the field in cognitive terms. This follows Alan Turing’s proposal in his paper “Computing Machinery and Intelligence”, in which the question “Can machines think?” is replaced with the question “Can machines do what we (as thinking entities) can do?”. In Turing’s proposal the various characteristics that could be possessed by a thinking machine and the various implications in constructing one are exposed.
The major breakthrough came when the Parallel Distributed Processing Group at the University of California-San Diego, led by Rumelhart and McClelland (1986), proposed the Back Propagation (BP) algorithm which allowed training of multilayer neural networks, overpassing the limitations of linear separability. Now neural networks with multiple layers were solvable. These neural networks were limitless in power, theoretically. They could mathematically solve any regression or discrimination task with ease. In 1989, Yann LeCun et al. applied the standard backpropagation algorithm, which had been around as the reverse mode of automatic differentiation since 1970, to a deep neural network with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.
In 1982, David Marr, a British neuroscientist, published another influential paper — “Vision: A computational investigation into the human representation and processing of visual information”. Building on the ideas of Hubel and Wiesel (who discovered that vision processing doesn’t start with holistic objects), David gave us the next important insight: He established that vision is hierarchical. The vision system’s main function, he argued, is to create 3D representations of the environment so we can interact with it. David Marr’s representational framework for vision includes:
- A Primal Sketch of an image, where edges, bars, boundaries etc., are represented (this is clearly inspired by Hubel and Wiesel’s research);
- A 2½D sketch representation where surfaces, information about depth and discontinuities on an image are pieced together;
- A 3D model that is hierarchically organized in terms of surface and volumetric primitives.
David Marr’s work was groundbreaking at the time, but it was very abstract and high-level. It didn’t contain any information about the kinds of mathematical modeling that could be used in an artificial visual system, nor mentioned any type of a learning process. Around the same time, a Japanese computer scientist, Kunihiko Fukushima, also deeply inspired by Hubel and Wiesel, built a self-organizing artificial network of simple and complex cells that could recognize patterns and was unaffected by position shifts. The network, Neocognitron, included several convolutional layers whose (typically rectangular) receptive fields had weight vectors (known as filters). These filters’ function was to slide across 2D arrays of input values (such as image pixels) and, after performing certain calculations, produce activation events (2D arrays) that were to be used as inputs for subsequent layers of the network. Fukushima’s Neocognitron is arguably the first ever neural network to deserve the moniker deep; it is a grandfather of today’s convnets.
A few years later, in 1989, a young French scientist Yann LeCun applied a backprop style learning algorithm to Fukushima’s convolutional neural network architecture. After working on the project for a few years, LeCun released LeNet-5 — the first modern convnet that introduced some of the essential ingredients we still use in CNNs today. As Fukushima before him, LeCun decided to apply his invention to character recognition and even released a commercial product for reading zip codes.Besides that, his work resulted in the creation of the MNIST dataset of handwritten digits — perhaps the most famous benchmark dataset in machine learning.
The LeNet Architecture (1998)
LeNet was one of the very first convolutional neural networks which helped propel the field of Deep Learning. This pioneering work by Yann LeCun was named LeNet5 after many previous successful iterations since the year 1988. At that time the LeNet architecture was used mainly for character recognition tasks such as reading zip codes, digits, etc. The bank check recognition system that he helped develop was widely deployed by NCR and other companies, reading over 10% of all the checks in the US in the late 1990s and early 2000s. LeNet is an important architecture is that before it was invented, character recognition had been done mostly by using feature engineering by hand, followed by a machine learning model to learn to classify hand engineered features. LeNet made hand engineering features redundant, because the network learns the best internal representation from raw images automatically.
However, due to the unavailability of large datasets, no major neural network models were developed until 2010s. Until then, all major approaches to object recognition essentially worked on basic machine learning methods. To improve their performance, one can either collect larger datasets, learn more powerful models, and use better techniques for countering overfitting. However as mentioned earlier, until 2010, datasets of labeled images were relatively quite small — on the order of tens of thousands of images, which could only be used for simple recognition tasks like handwritten digits recognition etc.
AlexNet (2012) and the re-Birth of “Deep Learning”
Imagenet is an yearly challenge for the classification of images pertaining to 1000s of categories. In 2012,AlexNet easily outperformed all the prior competitors and won the image net challenge by reducing the top-5 error from 26% to 15.3%. The second place top-5 error rate, which was not a CNN variation, was almost twice this, at 26.2%. AlexNet was designed by Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever. Due to the technological limitations of 2012 (not so powerful GPUs), the architecture proposed for Alexnet was divided in two different convolutional pipelines. Current day implementations of AlexNet does combine the two separate divisions of the network (top and bottom) together. Overall, AlexNet had a very similar architecture as LeNet. It was deeper, with more filters per layer, and with stacked convolutional layers which were actually followed by layers for max pooling, dropout, data augmentation, ReLU activations and it used stochastic gradient descent with momentum for updates. It attached ReLU activations after every convolutional and fully-connected layer.
The winner of the ILSVRC 2014 competition was GoogleNet or Inception V1 from Google. It achieved a remarkable top-5 error rate of 6.67%. This network implemented an interesting feature which is dubbed as an inception module, which had a spreading architecture. This was one of the first CNN architectures that really strayed from the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure. It had hidden layers for batch normalization, image distortions and RMSprop was used for updating. Despite being a 22 layer deep CNN, this CNN reduced the number of parameters from 60 million in AlexNet to a mere 4 million.
The Inception Module
The bottom green box is our input and the top one is the output of the model (Turning this picture right 90 degrees would let you visualize the model in relation to the last picture which shows the full network). Basically, at each layer of a traditional ConvNet, you have to make a choice of whether to have a pooling operation or a conv operation (there is also the choice of filter size). What an Inception module allows you to do is perform all of these operations in parallel. In fact, this was exactly the “naïve” idea that the authors came up with.
GoogLeNet was one of the first models that introduced the idea that CNN layers didn’t always have to be stacked up sequentially. Coming up with the Inception module, the authors showed that a creative structuring of layers can lead to improved performance and computationally efficiency. This paper has really set the stage for some amazing architectures that we could see in the coming years.
The runner-up at the ILSVRC 2014 competition is dubbed VGGNet by the community and was developed by Simonyan and Zisserman . Simplicity and depth. That’s what this model created in 2014 (weren’t the winners of ILSVRC 2014) best utilized with its 7.3% error rate. Karen Simonyan and Andrew Zisserman of the University of Oxford created a 19 layer CNN that strictly used 3×3 filters with stride and pad of 1, along with 2×2 maxpooling layers with stride 2. VGGNet had a very uniform structure and it consists of 16 convolutional layers. Quite similar to AlexNet in architecture, it involved a lot of filters. It is widely prevalent even till date due to its uniform structure. The weight configuration of the VGGNet is publicly available and has been used in many other applications and challenges as a baseline feature extractor. However, VGGNet consists of 138 million parameters, which become a bit of a challenge to handle.
At last, at the ILSVRC 2015, the so-called Residual Neural Network (ResNet) by Microsoft’s Kaiming He et al introduced a novel architecture with ―skip connections‖ and features heavy batch normalization. This network was finally able to defeat human level accuracy. Such skip connections are also known as gated units or gated recurrent units. Thanks to this technique they were able to train a NN with 152 layers while still having lower complexity than VGGNet.
The idea behind a residual block is that you have your input x go through conv-relu-conv series. This will give you some F(x). That result is then added to the original input x. Let’s call that H(x) = F(x) + x. In traditional CNNs, your H(x) would just be equal to F(x) right? So, instead of just computing that transformation (straight from x to F(x)), we’re computing the term that you have to add, F(x), to your input, x. Basically, the mini module shown below is computing a “delta” or a slight change to the original input x to get a slightly altered representation (When we think of traditional CNNs, we go from x to F(x) which is a completely new representation that doesn’t keep any information about the original x). The authors believe that “it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping”. It achieves a top-5 error rate of 3.57%.