Source: Deep Learning on Medium

<Deep learning>

Yann LeCun, Yoshua Bengio & Geoffrey Hinton

Deep learning is a kind of representation learning, which allows a machine to be fed with raw data and to automatically discover the representations needed for its task such as classification. This eliminates the need for handcraft features, which is one of the most important advantage for deep learning over conventional machine learning methods.

Conventional machine learning methods such as linear regression can only carve their input space into very simple regions. However for complex tasks, such as image recognition, requires the input-output function to be insensitive to irrelevant information, at the same time very sensitive to relevant information. A shallow classifier like linear regression could not possibly do that. That’s why practical conventional methods with shallow classifiers has to have good handcrafted features. While deep learning provides a general purpose learning process to learn good features automatically.

Supervised learning: To perform supervised learning, we need data with labels. We show the system the input, and hope its prediction would be identical to the label. This probably will not happen until we train the system. The training process start with compute an objective function that measures the error between the output and the desired output. The machine then adjusts its parameters, through methods such as backward propagation, to reduce this error.

Backpropagation: a procedure to compute the gradient of an objective function with respect to the weights of neural nets. It is basically an application of the chain rule of derivatives. The key insight is to start from the output and work backwards.

Stochastic gradient descent (SGD): showing the input vector for a few examples, computing the output and errors, computing the average gradient of those examples and adjusting the weights accordingly. Repeat this process for many small sets of examples until the average gradient stops to drop. Each small set of examples gives a noisy estimate of average gradient over all examples.

Convolutional neural networks: consist of convolutional layers, which apply a filter that each unit is connected to a local patch of the previous layer, and pooling layers, which merge semantically similar features into one, typically taking the maximum or average of the local patch. The reason of this architecture is twofold. First, in array data such as images, local groups of values are highly correlated, forming motifs that are easily detectable. Second, those motifs are invariant to location, i.e the same motif can appear in other places. Therefore it makes sense to use those motifs as base units rather than raw pixel data.

Recurrent neural networks: uses output from previous timestamp as part of the input for current timestamp. It has the same weight matrix W. LSTMs which has a gate unit allows better modeling of long range relationships.

Distributed representations: words are conventionally modeled as one-hot vectors. Word embedding models words into dense vectors that keeps their semantic relationships.