Source: Deep Learning on Medium

*This is a reading note for the paper **Deep Learning** which published on nature and written by three most prestigious scholars in this area, Geoffrey Hinton, Yann LeCun and Yoshua Bengio.*

**So, What is deep learning anyway?**

The very essence of deep learning or neural networks is nothing but a multiple levels representation structure. Each layer is composed of many simple but non-linear modules which transform the representation from input or lower layers’ output. With the stacked of enough such transformation, a very complex function can be learned by the deep learning models.

However, the description before might still not clear for you to understand what is a deep neural network. So, to thoroughly comprehend this structure, you need to have a gist of what is the function of different layers.

In this paper, they used a very concise paragraph to deliver the insight of each layer’s function. I will just directly quoted it here since I feel there is nothing to change to make it even more understandable.

*“…for example, comes in the form of an array of pixel values, and the learned features in the** first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image**. **The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers** would detect objects as combinations of these parts.”*

After reading this paragraph, it is not hard to see that a higher layer will do a sort of conclusion for the knowledge learned by lower layers. This is what we called the ability of ‘hierarchical learning’, and we will elaborate this point later on. The most important thing we need to take away from this paragraph is that deep learning is deep because higher layers can give a more universal and invariant representation to learning material, an image, a piece of article or a sequence of voice spectrum. To show this idea, writers of this paper also take us back to the 1960s to see the history of the development of linear classifiers.

One of the best ways to understand the linear classifiers is understanding how they divide the space of representations. Imagine that we are doing a very simple task which we asked to use lines as boundaries to divide a 2D space so that the red dots and black dots are now fully separated. Learning a linear classifier is nothing more than this task’s process. However, a real-life learning task is not going to be as friendly as this task, it will raise many problems to the linear classifier family. For instance, we have a distribution of data from different categories and they are not linearly separable, or, let’s see the discrepancy of two data is too minor to learn a linear classifier. For the later problem, we have a lot of real-life challenges very like that, and NLP tasks all almost very hard to learn by a simple model. This is why a shallow classifier needs a good feature extractor to solve the problem. So, as an improvement of the linear classifier, deep learning has two major advantages. The first one is that we add a non-linear activation function into our models, so we can have some non-linear representability to our data. Second, deep learning model can directly learn the features from the data, like the motifs in the images.

**Supervised Learning**

**Weight, how do we learn again?**

Though you may have some intuitive idea about how deep learning works now, we still haven’t talked about the implementation detail of this kind of model in general. However, before we move to the deep learning detail, I need to first introduce something called supervised learning. Supervised learning is not a concept that only belongs to deep learning. Actually, it is a widely used idea in the overall machine learning area. Supervised learning, opposite to unsupervised learning obviously, has labeled data, which means your classifier was told which data belongs to which category during the training process. If you still didn’t fully understand the meaning, just think about a sack of ancient coins which you don’t know their face values. Let’s assume there are three kinds of coins in such sack, round coins, square coins, and rectangular coins. For supervised learning, I will tell you that round coins have a face value equal to 5 ancient dollars, square coins have a face value equal to 10 ancient dollars, and rectangular coins’ face value was 2 ancient dollars. So you can train a CNN model to learn their shape and classify them into three different classes which are 2 dollars, 5 dollars, and 10 dollars. However, for unsupervised learning, I won’t tell you their face values at the beginning, your model can still possibly learn and classify them into three classes, round, square, and rectangular coins. In this article, we will only focus on supervised learning because most of the learning models we introduced today are supervised models.

In supervised learning tasks, learning features is the most important mission, if not the only one, to our models, because the models will use those features to decide the class of an incoming object. At this point, you might ask what is a feature exactly. I apologize for that. Though we’ve been using the word ‘feature’ all along, I haven’t explained this jargon to you. Features exist not only in data but also in most natural entities. Humen’s features are a lot, we have eyes and ears. If we only use ‘having eyes and ears’ as a feature to represent human, is that enough for a classification task? The answer is, it depends. If we are using this feature to teach our model to classify human and apples, of course ‘having eyes and ears’ will be enough. However, if we do the same thing for a task that required us to classify human and dogs, it is not an ideal feature for sure. So, now you can see how features varied because of the task scenario. In most of the cases, a single learning model will learn multiple features, which means we can not solely depend on only one feature to decide the class. And that’s when the weights take effect. Let’s go back to the classifying human and dogs task because I really enjoy it. Assume I’m a very bad engineer in such a lovely task, and I only provide you two features which are their hair colors and their body temperature. For the next object, if I told you that it has golden hair and its body temperature is 38.5-Celsius degree, is a dog or a person. First, if you solely decide your answer depends on the hair color, how sure will you be. Since both human and dogs can have golden hair, so I’ll only put 20% percent weight on it. According to Wikipedia, a person’s normal body temperature lies between 36.5 to 37.5-Celsius degree, while a dog’s normal body temperature is about 38.2 to 39-Celsius degree. You might think isn’t the body temperature is enough to give us an answer? You might think since the body temperature is 38.5-Celsius degree it should be a dog. However, the same question again, how sure will you be on this? I do agree that compare to hair color, this information does help more, but I don’t think I can solely decide my answer because of it. What if this human is in a fever? What if he just came out from a sauna room? So, I only put 80% credit on it. In general, since human and dogs tend to have similar hair color, grey, gold, brown or black, we can not rely on this feature too much. However, human and dogs body temperature differed normally and not like to overlap in normal condition, we will give this feature more weight. It’s the same principle for most of the classification problems, models need to learn a specific distribution for each feature in a task. If you still haven’t lost in my dog and human example, you probably remember what you read in the first section that deep learning models can learn the features by themselves. How did they do that? The answer is, forward pass and backpropagation.

**Neuron**

A single neuron in a neural network has two major equations.

z = sum(wkl *yk), where yk are the output from the last layer’s neurons, wkl are the weights for the last layer. For each yk, there is a specific wkl.

y = f(z), where f is an activation function. An activation function is usually a nonlinear function, sigmoid, tanh, or relu, in a deep learning structure. Activation functions give the network non-linearity, and without these functions, it is meaningless to stack up layers because multiple layers of linear transformations equal to one single linear transformation.

However, don’t forget there are multiple neurons for each layer. So for every layer, we need to calculate many ys and for next layer each neuron there will receive many ys we just calculated.

**Forward pass**

A neural network model has many neurons, and each of these neural manages several weights. For a large neural-network model, it is very usual to have multi-millions weights. First, we initialize all the weights with a small number(you will say why this soon later) and pass input data through all the neurons. Then we will calculate a final value for this data according to the weights, and we need to compare this value to the outcome. We use one-hot encoding to represent the class label. If we have 5 classes, we will use a (0,5) shape vector to represent the space of classes. For a 3rd class label, we simply use [0,0,1,0,0] to represent it. After we one-hot encode the label, there is another thing to do before we calculate the loss for the value we calculated before. That thing is doing the softmax for the calculated value. The ‘value’ we calculated before is not a scalar, it’s a vector instead. The size of the last layer of our models is exactly equal to the class number ** m**. That is to say, we have

**such neurons in the last layer. Softmax gives us nothing but a distribution of probability for the classes. And we use the cross-entropy loss to find out the loss between our distribution and the real distribution.**

*m***Backpropagation**

After we obtain the difference between our estimated distribution and the real class distribution, we use those value to adjust our weights layers before. The trick we need to apply to adjust our weights is called backpropagation. Backpropagation is so important that we can say this trick enabled the deep learning framework. Backpropagation is nothing but the chain rule and partial derivative. For a specific layer k, we define the layer before, which is closer to the input, as j, and define the layer later, which is closer to the output, as l. Let’s assume yl is the output for l, and zl is the weighted sum before activation function.

Then,

delta_yk = sum(wlk*delta_zl), where delta_yk represent the derivative for the output from k layer.

delta_zk = delta_yk * (dy/dz), where dy/dz is the derivative of the activation function.

And use those gradients to update our weights. w_new = w_old -gradient_w. And this method is called gradient descent. The negative gradient vector indicates the direction of the steepest descent in this landscape. If our model descent along with this direction, we can find a (local) minimum. In practice, we tend to use stochastic gradient descend to avoid the variance from each training data. We take a small set of data and calculate the average gradient over them and use that average gradient do the update for our weights.

So, the insight of backpropagation is just passing the gradient of loss through all over the network, nothing mystical and magical.

In the end, your model will train a specific combination for all the millions of parameters(or weights) that will give the classifier the best performance.

**Local minimum**

One more thing to notice in this part is the local minimum. Many scientists use to believe gradient descent is not a viable optimization method since they believe it will be stuck in a local minimum. However, after so many research, the scientists realized that we should not worry about the local minimum problem too much. Yes, the gradient descend method does lead us to the local minimum, however, the whole plane is composed of many small local minimums which have about the same level of depth. So it doesn’t really matter which local minimum we were stuck with.

**Convolutional neural networks**

The convolutional neural network family is specific deep learning structure use for image process and classification. Though we have started to use the convolution technique to process graph and sound for a long time, deep learning with convolution technique is something we only widely implemented in this decade.

Unlike the traditional deep learning model, a convolutional neural network, or CNN, has two major change. The first one is called the feature map, and the other one is called the pooling layer. CNN can be used in different dimension according to the object of the task. In general, 1D CNN models are usually used for signals and sequences processing, like language or voice signal. 2D CNN models are used for image or audio spectrogram processing. 3D CNN used for video or volumetric images.

Unlike the usual input data, the input data for image classification are images, which have different channels. CNN borrow that concept and expand it, and one image can have many channels on CNN.

**Feature map and pooling**

So, to comprehend all of the new concepts, we need to start with the most important two, feature map and pooling layer. In 2D CNN, feature map is nothing but a weight matrix. We use this matrix to convoluted each pixel from the input images, and calculate a weighted sum for a specific area of pixels. One such matrix will process all the areas and channels for a single input, that is to say, they share the weights. It is quite important to understand why they share the same weights over all the parts of an image. I will directly quote the reason for using shared weights from the paper.

*“The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array…”*

The pooling layers are used to merge the semantically similar features. The major reason for pooling is to increase the invariance of your model. Invariance means even we rotate or tilt the original picture for a little your, model still can recognize it.

**Hierarchical**

Another important take-home message for this part is that a deep learning network can learn the hierarchical relation from the data. Like we mentioned before, a deep learning network can learn the motif and objects from the image. By applying both feature map and pooling technique, the model is able to learn the motifs even the positions have changed next time.

**Recurrent neural networks**

RNN is another important variation of deep learning. However, unlike CNN and traditional deep learning, RNN has a sequence structure. A simple RNN network is composed of a state, and such a state will repeat itself several times and that is what we called time step. For every time step, there is an input from a sequence structure data, for example, if our input data are sentences, for every time step we input a word from the same sentence. And for each time step we a hidden state which is the calculated like following.

h_1 = W*h_0 + U*x_1 + bias. Where W is the state matrix, U is the input matrix, h_0 is the hidden state from the last step and a bias item. W and U are both used through the whole sequence, which means they are the shared weights for RNN.

By carrying the hidden state to the next state, we’ve been able to use the information from the past. The backpropagation for RNN flows from the final state to the first state, and we call this backpropagation across time.

A very obvious shortcoming for RNN is called gradient vanishing. This problem exists in all the very deep models, however, since RNN is a sequential model it is more like to happen here. Gradient vanishing means the gradient of loss which passed from the final state became too small or even zero. This will affect the update of the weights.

So, scientists came up with LSTM, which is a variant of RNN to confront this problem. LSTM and more similar design introduce some ‘gates’ to control what should be kept through the whole sequence and what should not. This solved the gradient vanishing problem because the less we carried through the time, the lower the chance our gradient will vanish during backpropagation. This is just a very brief insight for LSTM. If you want to expand reading about LSTM, here is the best article you should read.

One more thing I learned from this paper is the following paragraph:

The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference…

That should be all the summarization I got for this paper. This paper contains so many good insight that will help you get the very important intuition for your deep learning models. I highly recommend people to read this occasionally, even just for freshing up their intuition for the different deep learning models.