Deep Learning glossary

Source: Deep Learning on Medium

So vast to get started, let’s make the first shallow steps

source from

Deep Learning Neural Networks are inspired by the structure of cerebral cortex. What was the motivation? To create Artificial Neural Networks to solve problems in the same way as the human brain. Deep learning is currently the most prominent and widely successful method in Artificial Intelligence (AI). But there remains a significant uncertainty in the technical literature as to why these networks perform so well, what are their limits and which ethical constraints we should impose upon their use.

I would like to introduce you to the basic terminology, to concepts that you would like to know before diving deep in deep learning. And what I mean by saying deep in deep learning? Mainly, because of their exciting ability to discover novel solutions directly from the complex problem data.

Activation Functions

The activation function defines if a given node should be “activated” or not based on the weighted sum on that layer and the existing bias. Node’s activation is dependent on its incoming weights and bias, so a node has learned a feature if its weights and bias cause that node to activate when the feature is present in its input. They introduce non-linear properties to the network.

The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. Common activation functions are

  • Binary step function
  • Linear activation function
  • Sigmoid
  • TanH — Hyperbolic Tangent
  • Rectified Linear Unit (ReLU)
  • Softmax
  • Maxout

Back-propagation and Gradient Descent

The most popular algorithm since 1980’s has been the back-propagation learning, short for back-propagation of errors and is the heart of deep learning. Given an artificial neural network and a cost or loss or error function the method calculates the gradients of the error function with respect to the neural network’s weights.

Back-propagation is just a way of propagating the total loss back into the neural network to know how much of the loss every node is responsible for, and subsequently updating the weights in such a way that minimizes the loss by giving the nodes with higher error rates lower weights and vice versa.

Back-propagation learning

Optimization algorithms such as Stochastic Gradient Descent (SGD) use the gradient to drive learning. Other optimization algorithms widely used are RMSProp and AdaGrad and SGD with momentum (state-of-the-art).

You life journey is an optimization problem like gradient descent


Regularization is nothing but adding a penalty term to the objective function and control the model complexity — prevent over-fitting — using that penalty term. Thus the model generalizes better and subsequently you increase the model performance.

Overfiitting and different Regularization methods

Artificial Neural Networks (ANN)

Artificial Neural Networks (ANN) are the pieces of a computing system designed to simulate the way the human brain analyzes and processes information.

A neural network can be “shallow”, meaning it has an input layer of neurons, only one “hidden layer” that processes the inputs, and an output layer that provides the final output of the model. A Deep Neural Network (DNN) with commonly larger than 2 additional hidden layers of neurons.

An example of Artificial Neural Network

ANNs are often specialized, i.e. different activations and number of nodes (neurons) per layer.

Deep Convolutional Neural Networks

Convolutional Neural Networks (CNN or ConvNets)are a specialized form of feed-forward neural networks, in that they use convolution in place of general matrix multiplication in at least one of their layers. They most commonly applied to analyzing visual imagery or sound. ConvNets learn feature maps the express relationships regarding neighborhoods of data features, making them a good choice for datasets that exhibit a grid-like topology. They summarize regions by learning patterns.

Common properties are:

  • Kernel: is a matrix, which is slid across the input for example image and multiplied with the input.
  • Filter: multiple stacked kernels
  • Pooling: Provides a summary statistic for each group of neighboring outputs, thus offers invariance to small translations of the input.
  • Stride: is the step of the convolution operation.
  • Padding: is simply the process of adding layers of zeros to the input prior to the convolution
An example of CNN architecture

Case-study CNN : VGGNet or ResNet

The most common activation function is ReLU for the CNNs.


Autoencoder is a feed-forward neural network that learns to reproduce an approximation of its input in an unsupervised manner, it learns to encode its input into a “hidden code” hence the name autoencoder.

The aim of an autoencoder is to learn a representation for a set of data, typically for dimensionality reduction or clustering, generative models, manifold learning, etc.

Autoencoder architecture

If we use a shallow, linear autoencoder that is trained to minimise MSE between its input then output ≈ Principal Component Analysis (PCA).

Recurrent Neural Networks (RNN)

Humans don’t start their thinking from scratch every second. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence. The feed-forward neural networks are not capable of doing this. Recurrent neural networks (RNNs) address this issue. They are networks with loops (cycles) in them, allowing information to persist.

The main difference with other neural networks is that they take into account a sequence of data, often a sequence evolving over time and thus exhibit temporal dynamic behavior and most importantly this sequence can be of arbitrary length. For example in the case of analyzing temporal data (time series) the network will still have in memory a part or all of the observations previous to the data being analyzed.

RNN Cells are the backbone of the recurrent neural networks. An RNN cell, in the most abstract setting, is anything that has a state and performs some operation that takes a matrix of inputs and outputs a scalar object.

Representation of an RNN cell, source: Stanford

Hyperbolic tangent(tanh) is mostly used as activation function in RNNs.

Recurrent neural networks are at the heart of many substantial improvements in areas as diverse as speech recognition, automatic music composition, sentiment analysis, DNA sequence analysis, machine translation.

RNNs are good in handling sequential data but they run into problem when the context is far away. This is the problem of long-term dependencies that cause the gradients propagating over many stages to explode or vanish. And even if parameters are stable long-term dependencies will have smaller weights than short-term dependencies. The use of LSTMs solve this problem.

Long Short-Term Memory (LSTM)

LSTMs are a very special kind of Recurrent neural networks which work, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them.

LSTMs were developed to deal with the exploding and vanishing gradient problems that can be encountered when training traditional RNNs. The LSTM Unit also called a “memory cell” is a gated unit (it contains gates). LSTM recurrent networks have “LSTM cells” that have an internal recurrence (a self loop), in addition to the outer recurrence of the RNN. LSTMs can remove or add information to the cell state, carefully regulated by gates — Gates are a way to optionally let information through — LSTM has three of these gates, to protect and control the cell state.

LSTM vs RNN Unit

A variant of LSTMs is the Gated Recurrent Unit (GRU).

An important limitation of LSTMs is the memory and how this can be abused for example force them to remember a single observation in a very long input of time steps.

Generative Adversarial Networks (GAN)

GANs are basically made up of a system of two competing (thus adversarial) neural network models (the discriminator and the generator) which compete with each other and are able to analyze, capture and copy the variations within a dataset in an unsupervised manner.

One neural network, called the generator, generates new data instances, while the other, the discriminator, evaluates them for authenticity; i.e. the discriminator decides whether each instance of data that it reviews belongs to the actual training dataset or not. The goal of the generator is to generate content that mimics the real one, i.e. to lie without being caught. Both nets are trying to optimize a different and opposing objective function, or loss function, in a zero-sum game.

Basic GAN architecture, source:

They are used widely in data augmentation and synthesizing in various applications like image generation, video generation and voice generation. They can also be used to generate fake media content, and are the technology underpinning DeepFakes.

Philosophy of Deep Learning

A few thoughts to share for Deep Learning

Deep Learning and Humans source Melanie Swan
  • Deep Learning is emphasizing in the presence of Big Data, older learning algorithms no longer performing
  • Deep Learning redefines human identity in the context of the machine age
  • Deep learning is an advanced statistical method
  • Deep Learning is a smart network. Global computational infrastructure that operates autonomously.