Source: Deep Learning on Medium

So what exactly is deep learning? How does it work? And most importantly, why should you even care?

**What is Machine Learning?**

Before we dive into deep learning, I want to take a step back and talk a little bit about the broader field of “machine learning” and what it means when we say that we’re programming machines to *learn*.

Sometimes we encounter problems for which it’s really hard to write a computer program to solve. For example, let’s say we wanted to program a computer to recognize hand-written digits:

You could imagine trying to devise a set of rules to distinguish each individual digit. Zeros, for instance, are basically one closed loop. But what if the person didn’t perfectly close the loop. Or what if the right top of the loop closes below where the left top of the loop starts?

In this case, we have difficulty differentiating zeros from sixes. We could establish some sort of cutoff, but how would you decide the cutoff in the first place? As you can see, it quickly becomes quite complicated to compile a list of heuristics (i.e., rules and guesses) that accurately classifies handwritten digits.

And there are so many more classes of problems that fall into this category. Recognizing objects, understanding concepts, comprehending speech. We don’t know what program to write because we still don’t know how it’s done by our own brains. And even if we did have a good idea about how to do it, the program might be horrendously complicated.

So instead of trying to write a program, we try to develop an algorithm that a computer can use to look at hundreds or thousands of examples (and the correct answers), and then the computer uses that experience to solve the same problem in new situations. Essentially, our goal is to teach the computer to solve by example, very similar to how we might teach a young child to distinguish a cat from a dog.

One of the big challenges with traditional machine learning models is a process called *feature extraction*. Specifically, the programmer needs to tell the computer what kinds of things it should be looking for that will be informative in making a decision. Feeding the algorithm raw data rarely ever works, so feature extraction is a critical part of the traditional machine learning workflow. This places a huge burden on the programmer, and the algorithm’s effectiveness relies heavily on how insightful the programmer is. For complex problems such as object recognition or handwriting recognition, this is a huge challenge.

*Deep learning is one of the only methods by which we can circumvent the challenges of feature extraction. This is because deep learning models are capable of learning to focus on the right features by themselves, requiring little guidance from the programmer. This makes deep learning an extremely powerful tool for modern machine learning.*

**Terms you should know and what they mean —**

**Neuron(Node) **— It is the basic unit of a neural network. It gets certain number of inputs and a bias value. When a signal(value) arrives, it gets multiplied by a weight value. If a neuron has 4 inputs, it has 4 weight values which can be adjusted during training time.

**Connections** — It connects one neuron in one layer to another neuron in other layer or the same layer. A connection always has a weight value associated with it. Goal of the training is to update this weight value to decrease the loss(error).

**Bias(Offset)** — It is an extra input to neurons and has it’s own connection weight. This makes sure that even when all the inputs are none (all 0’s) there’s gonna be an activation in the neuron.

The weight shows the effectiveness of a particular input. More the weight of input, more it will have impact on network.

On the other hand Bias is like the intercept added in a linear equation. It is an additional parameter in the Neural Network which is used to adjust the output along with the weighted sum of the inputs to the neuron. Therefore Bias is a constant which helps the model in a way that it can fit best for the given data.

**Activation functions** — Activation functions are an extremely important feature of the artificial neural networks. They basically decide whether a neuron should be activated or not. Whether the information that the neuron is receiving is relevant for the given information or should it be ignored.

The activation function is the non linear transformation that we do over the input signal. This transformed output is then sen to the next layer of neurons as input.Can we do without an activation function?

Now the question which arises is that if the activation function increases the complexity so much, can we do without an activation function?

When we do not have the activation function the weights and bias would simply do a linear transformation. A linear equation is simple to solve but is limited in its capacity to solve complex problems. A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks. We would want our neural networks to work on complicated tasks like language translations and image classifications. Linear transformations would never be able to perform such tasks.

Popular types of activation functions —

**Input Layer **— This is the first layer in the neural network. It takes input signals(values) and passes them on to the next layer. It doesn’t apply any operations on the input signals(values) & has no weights and biases values associated. In our network we have 4 input signals x1, x2, x3, x4.

**Hidden Layers — **Hidden layers have neurons(nodes) which apply different transformations to the input data. One hidden layer is a collection of neurons stacked vertically(Representation). In our image given below we have 5 hidden layers. In our network, first hidden layer has 4 neurons(nodes), 2nd has 5 neurons, 3rd has 6 neurons, 4th has 4 and 5th has 3 neurons. Last hidden layer passes on values to the output layer. All the neurons in a hidden layer are connected to each and every neuron in the next layer, hence we have a fully connected hidden layers.

**Output Layer **— This layer is the last layer in the network & receives input from the last hidden layer. With this layer we can get desired number of values and in a desired range. In this network we have 3 neurons in the output layer and it outputs y1, y2, y3.

**Input Shape **— It is the shape of the input matrix we pass to the input layer. Our network’s input layer has 4 neurons and it expects 4 values of 1 sample. Desired input shape for our network is (1, 4, 1) if we feed it one sample at a time. If we feed 100 samples input shape will be (100, 4, 1). Different libraries expect shapes in different formats.

**Weights(Parameters)** — A weight represent the strength of the connection between units. If the weight from node 1 to node 2 has greater magnitude, it means that neuron 1 has greater influence over neuron 2. A weight brings down the importance of the input value. Weights near zero means changing this input will not change the output. Negative weights mean increasing this input will decrease the output. A weight decides how much influence the input will have on the output.

**Forward Propagation — **Forward propagation is a process of feeding input values to the neural network and getting an output which we call predicted value. Sometimes we refer forward propagation as inference. When we feed the input values to the neural network’s first layer, it goes without any operations. Second layer takes values from first layer and applies multiplication, addition and activation operations and passes this value to the next layer. Same process repeats for subsequent layers and finally we get an output value from the last layer.

**Back-Propagation** — After forward propagation we get an output value which is the *predicted value*. To calculate error we compare the predicted value with the *actual output value*. We use a *loss function* (mentioned below) to calculate the *error value*. Then we calculate the derivative of the *error value* with respect to each and every weight in the neural network. Back-Propagation uses chain rule of Differential Calculus. In chain rule first we calculate the derivatives of *error value* with respect to the *weight values* of the last layer. We call these derivatives, *gradients* and use these *gradient* values to calculate the *gradients* of the second last layer. We repeat this process until we get *gradients* for each and every weight in our neural network. Then we subtract this *gradient value* from the *weight value* to reduce the *error value*. In this way we move closer (descent) to the *Local Minima*(means minimum loss).

**Learning rate **— When we train neural networks we usually use *Gradient Descent* to optimize the weights. At each iteration we use back-propagation to calculate the derivative of the loss function with respect to each weight and subtract it from that weight. Learning rate determines how quickly or how slowly you want to update your *weight*(parameter) *values*. Learning rate should be high enough so that it won’t take ages to converge, and it should be low enough so that it finds the local minima.

**Accuracy** — Accuracy refers to the closeness of a measured value to a standard or known value.

**Precision **— Precision refers to the closeness of two or more measurements to each other. It is the repeatability or reproducibility of the measurement.

**Recall(Sensitivity)** — Recall refers to the fraction of relevant instances that have been retrieved over the total amount of relevant instances.

**Convergence** — Convergence is when as the iterations proceed the output gets closer and closer to a specific value.

**Regularization **— It is used to overcome the over-fitting problem. In regularization we penalize our loss term by adding a L1 (LASSO) or an *L*2(Ridge) norm on the weight vector *w* (it is the vector of the learned parameters in the given algorithm).

L(Loss function) + *λN*(*w*) — here λ is your ** regularization term **and N(w) is L1 or L2 norm

**Normalization** — Data normalization is the process of rescaling one or more attributes to the range of 0 to 1. Normalization is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). It is good to speed up the learning process.

**Fully Connected Layers **— When activations of all nodes in one layer goes to each and every node in the next layer. When all the nodes in the Lth layer connect to all the nodes in the (L+1)th layer we call these layers fully connected layers.

**Loss Function/Cost Function — **The loss function computes the error for a single training example. The cost function is the average of the loss functions of the entire training set.

- ‘
*mse’*: for mean squared error. - ‘
*binary_crossentropy’*: for binary logarithmic loss (logloss). - ‘
*categorical_crossentropy’*: for multi-class logarithmic loss (logloss).

**Model Optimizers** — The optimizer is a search technique, which is used to update weights in the model.

**Performance Metrics** — Performance metrics are used to measure the performance of the neural network. Accuracy, loss, validation accuracy, validation loss, mean absolute error, precision, recall and f1 score are some performance metrics.

**Batch Size **— The number of training examples in one forward/backward pass. The higher the batch size, the more memory space you’ll need.

**Training Epochs** — It is the number of times that the model is exposed to the training dataset.

One **epoch** = one forward pass and one backward pass of *all* the training examples.

Below are few useful topics that you need to drink and digest before you start Deep Learning/Machine Learning :) —

- Linear algebra: Vector and matrix operations.
- Probability and statistics
- Calculus especially differential calculus.
- Numerical optimization