Deep learning review


In this post we will go over the review paper on deep learning by Hinton, LeCun and Bengio — all known for their seminal work in this space. This paper was published in 2015, so this review might feel a little dated, but i found a lot of value from the paper nonetheless. This paper does a great job of summarizing trends in Deep learning and also challenges with current techniques. Unlike other papers i have covered in the past, this doesn’t specifically talk about a system, but provides a high level overview of the deep learning space. You would need some familiarity with the machine learning space for this to be of value.

Motivation for Deep learning

ML is used in a lots of applications these days such as Web searches, Content classification/filtering, Recommendation of products for e-commerce etc.. The traditional techniques for supervised learning such as regression or SVM have been effective, but also require a lot of engineering effort. To build a successful application, features need to be identified carefully with the help of a domain expert, raw data needs to be cleaned and transformed to feature vectors — which then the system can use to classify it further.

Deep learning falls into a category of representational learning methods in which raw data can fed to a system and it can then discover internal representations(features) automatically. Newer and more intricate representations then get formed as more non-linear layers of learning are added to the system. A good example of this detecting objects in images. In a deep layered model, first layers can identify edges, later layers can then identify motifs that are formed by edges. Next sets of layers can then actually represent parts of objects and the final layers then actually detect the familiar objects. At each layer, one of the great advantages of deep learning is that it doesn’t rely on very precise feature representation i.e. edges can be oriented differently, they can appear somewhere else in the picture, same object can appear at different places in the image. Deep learning models can start with raw pixels and then build these intermediate representations with a lot of feature tuning and by using generic techniques. Because of this deep learning models also can extract more value as compute and data becomes more readily available.

Shallow vs Deep Learning

Let’s contrast these two in detail by first going over conventional learning methods.

Conventional shallow learning

In traditional shallow supervised learning methods such as a simple neural net or linear regression, the model is shown some labeled data e.g. images that are labeled such as images containing tumor or not. The model creator, then carefully engineers the input data, chooses features that could be interesting and then feed the curated feature vectors to the input model with some weights. Then using these weights and feature combinations, output error against the desired output is calculated. The objective of the algorithm is then to tune the weights and features so as to minimize the error.

The process of adjustment of weights is done by using stochastic gradient descent algorithm. Gradients(partial derivatives used to minimize errors locally) then used on these feature-weight combinations to reduce error. This is done iteratively till we cannot reduce the error any more. This technique has proved be very effective for reducing the overall error over all samples in conventional machine learning. What linear classifiers do is that they divide the space in simple regions separated by a hyperplane. To get more non-linearity, one can use non-linear Kernels such as Gaussian kernel, but they don’t always generalize very well.

Deep Learning advantages

Deep learning tries to overcome some of the issues mentioned in the last section. Deep learning models can be tuned to be insensitive to variations in the input. In image and speech recognition, it is pretty common to have variations in location(e.g. a dog in the middle of the image vs in the corner of an image), illumination(well lit vs poorly lit). In speech recognition, variations in accent or pitch need to be treated accurately. Deep learning is also able to handle this sensitivity tradeoff — cases where we needn’t be sensitive(similar dogs in different locations) and cases where we do need to be sensitive — consider similar looking species of dogs in a picture. Traditional shallow classifiers are unable to make these distinctions effectively since raw pixel data looks very different in the former and very similar in the later. They can only do so, when features extracted out of raw pixels can ignore or highlight those features. This is where deep learning comes in and avoids the need for heavy feature engineering — features are learned automatically via representational learning methods which are generic in nature.

Deep learning architectures consist of multiple layers of modules — each one representing some function that gets learned. These modules generally use non-linear function mappings. With multiple non-linear layers of 5–20 depth, systems can learn minute significant differences (similar looking species of dogs) and also can learn to ignore irrelevant differences such change in position, background or lighting.

History and rationale for Backprop

Multilayer networks were largely out of vogue for a long time. There was a lot of research on how to train multilayer networks in 70s and 80s. The key breakthroughs, at lease in understanding, were around using stochastic gradient descents on multi layer networks.

A multilayer neural network consists of some input layers, some hidden layers and then the output layer. Input layer represent the input and Output layer represent the outcome achieved by the neural net such as whether the given image contains a dog, cat or a horse. Hidden layers add non-linear smooth functions such as ReLu(basically relu(z) = max(0,z)). These functions can convert smaller dimensional input into higher dimensional spaces that can separate data out much more cleanly.

Non-linear cleaner separation achieved using non-linear functions

The gradient descent in such multi layer networks works by starting from the output layer and computing partial derivatives of the output w.r.t to the previous layer’s weights that it connects to. One can repeat the step backwards from the intermediate/hidden layer to previous layers, all the way to the input. This, it turns out, is nothing but an application of chain rule of derivatives. This was one of the key advances in understanding of neural nets. See the figure below:

X is input layer, one single hidden unit y and then output layer at Z. If you start from Z and back propagate to X using chain rule of derivatives, we can use these gradients in the next feedforward pass to reduce the error on Z.

If you want to read more on this, i would highly recommend Hacker’s guide on NN by Andrej Karpathy — it really does a great job of explaining how partial derivatives and backprop lead to optimization of the objective function (reduction in error) using partial derivatives.

Historical issues with multilayer learning: One of the issues that was thought would be a problem for gradient descent, was that it could get trapped in a local minima. Empirical results and further research seems to suggest that this is not a such a big issue as the systems ends up landing on a good minima. Another issue was that training such large multilayer networks with millions of weights would take a long time. Advent of GPUs, parallelized algorithms have helped in this regard.

To address one of the earlier issues of careful feature engineering, researches in 2000s developed unsupervised learning(pre-training) on unlabled data that would act as a sensible feature extractor. These auto extracted features could then get fed back into the traditional neural networks which could be trained using standard backprop schemes. Detection of handwritten digits and speech processing benefitted from this. One type of feed forward networks that really generalized well are Convolutional Neural Networks(ConvNet) which will cover in the next section.


ConvNet is a type of deep learning network that is specially effective for detecting patterns in images. The key ideas behind the architecture include the fact that images consist of local patches of values can be highly correlated and such patches can be position independent. Dee neural nets are built on the intuition that images are formed by composition of smaller objects i.e. edges forming motifs, motifs forming objects and so on. Similar compositional attributes are present in speech processing as well. Deep learning networks are good at exploiting this compositional nature by creating multiple deep layers. A typical ConvNet architecture consists of the following layers:

  1. Convolution layer: This layer consists of creating smaller sliding windows from the images and then using those to connect to neurons. Output is the multiplication of weights and the corresponding input in the frame. ReLU layer then applies the non-linear function it. The sliding window can slide with a certain stride size such as 1 or 2 to capture various patterns that may be present in the image.
  2. Pooling Layer: Pooling layer then reduces the dimension of the data from conv layer and creates a smaller sample. E.g. pool layer can apply max pooling in a region of 2×2 pixels and then only pick the max value as the representative for this region. So a 4 x 4 image can convert to 2×2 via such pooling.
  3. Fully connected layer: This is same as the traditional fully connected layer that converts these intermediate layers to final output — such as a classification problem.
A patch of an eye is along the RGB dimension undergoes convolution and ReLU first. The next layer performs max pooling. Typically 10–20 layers can be stacked stacked with millions of weights.

Despite prior successes, ConvNets have been largely out of favor in the mainstream ML community. This began to change when ImageNet competition took place in 2012. Some researchers used ConvNets to classify millions of images into 1000s of object categories that reduced error rates by more than a half compared to all other state of the art approaches at the time. Traditionally these are thought to be very expensive to train. But advent of GPUs, efficient regularization techniques such as dropout, ReLUs and using existing input to created more deformed inputs have substantially improved training times. Similarly ConvNets also seem amenable to FPGAs.

There is a lot of research in the industry on ConvNets now. Some of the classic applications of this technique include: Detecting faces in an image, Detecting street signs, Reading checks, Detecting patterns in bilogical images etc.

Recurrent Neural Networks

This is another type of deep feedforward networks that are useful for processing sequential inputs such text or speech. One of the main applications of RNN has been to predict the next character in a word or the next word in a sequence. Similar applications also exist in speech processing. One of the interesting applications of RNNs is to generate captions or meaningful texts for images by feeding output of a ConvNet on an image as an input to a RNN.

While traditionally, it was hard to converge gradients using backprop for RNNs(As they would either vanish or explode), newer techniques have made possible for complex uses of RNNs. Now RNNs can be used to extract “thoughts” expressed by an image or a sentence and then that can be fed into another RNN for translation into a different language.

As RNN model us unfolded, it becomes apparent how backprop can be used to train such a model. All the states s share the same weights and also capture some past state that can be input to the new state.

Hidden units in RNN end up capturing the past state in the system implicitly. As RNNs are unfolded, they can be thought of as neural networks with a lot of layers where all layers end up sharing the same weights. The research seems to suggest that it is very hard to store state that reflect real long term dependencies. So that leads to the advent of LSTM — long short-term memory based networks.


Researchers have found LSTM networks to be more effective compared to RNNs in applications like speech recognition. In LSTM, to address the storage or memory issues of RNNs, researchers have proposed adding memory to the neural networks. This can be done by adding special hidden neurons that can act as accumulators or that have a weight of one to itself. The later part helps with copying it’s own value to the next time step. This connection can be controlled by another unit, which can then learn when to clear up this memory.

Along the lines of adding memory modules, other researchers have suggested a concept of neural turing machines by adding associative memory to which the network can write and read from. Apparently such memory based networks have performed well in question-answer based applications. In such applications, networks are asked to remember a story and then the network needs to answer a question about that story. Another application of neural turing machine has to be to be able to learn algorithms — some have succeeded in sorting a dataset when a priority is specified with each item.

Future of Deep learning as of 2015

While supervised learning has had really good success, as per authors, future of deep learning seems tied to more advances in unsupervised learning. This is the same way humans seem to learn new ideas and tasks — by observing and not always being told what it is.

On the computer vision side, there seems to be potential in using ConvNets, RNNs and reinforcement learning. The research at the time was in early stages and continues to show impressive results. A lot of language understanding can benefit from advances in RNN.

Authors seem to indicate that big area of advancement in Artificial Intelligence will come from combination of representation learning techniques highlighted in this paper and complex inference.

Source: Deep Learning on Medium