Original article was published by Jeheonpark on Deep Learning on Medium

When I was young, I really like Lego. It was amazing for me because I can build anything with small blocks. I could build dragons, castles, and trains. There is no kid here but a grown man instead. I felt almost the same emotion from deep learning. There are basic blocks in deep learning and you can build anything you want. You can create autonomous driving, pictures, and drug candidates. I will explain the basic blocks and their glue to stick together in this post.

# Forward & Backpropagation

We need to know how the neural net calculates the output or its error. It is really easy. You put the input and input layer toss the result of the calculation to the next hidden layer. The calculation consists of a linear function and a non-linear function, activation function. Each neuron represents one linear regression but it has an activation function at the end. If you don’t have activation functions, then it is just big linear regressions. We propagate the result from layer to layer until it reaches the output layer. At the output layer, we calculate the loss function to evaluate the loss. We have many parameters in each neuron and we need to figure out how much each neuron contributes the loss. Linear regression uses the Gradient Descent Method to calculate the loss. It is the same. We will use Gradient Descent but the specialty is from chain rule of derivation. If you don’t know about chain rule check this video:

We can apply the chain rule because the neural network is actually a function composition consisted of linear functions and activation functions. We derivate step by step from the output and we will know which nodes are contributed how much errors. In this part, many problems were raised and solved. For example, the vanishing gradient, exploding gradient, and activation function selections.

# Gradient Descent

Let’s say you are at the summit of the mountain and don’t have a map. How do you go down the hill? There are many answers but deep learning chooses Gradient Descent. It goes along with the steepest hill. It is a nice strategy. However, we are in the mathematics world because we cannot see where is the steepest hill in the cost function. Fortunately, Newton and Leibniz invented the differentiation. We can calculate steepness by its derivative. If we have multi variables, then we can use partial derivative. Unfortunately, the mountain of the cost function is really complex. There are local minima, saddle points, and plateau. We need to overcome those unexpected obstacles and go down the hill as soon as possible.

## Batch Gradient Descent

You can consider your training data individually and also you can consider the whole training set to update your direction of the cost function. We differentiate the whole training set with respect to each variable, partial derivation. We update the result at one step. We can determine the size of the step, we decide the direction of the step till now. We call this learning rate.

If you have a big learning rate, you can skip your optimum point. you just jump to another hill. If you have a small learning rate, the speed is too slow. It costs a lot of computation.

## Stochastic Gradient Descent

Batch Gradient Descent has a big problem that it costs a lot of computation power and it is slow. The solution is Stochastic Gradient Descent(SGD). SGD randomly picks an instance from the training set and calculates the gradient. Since we arbitrarily choose the instance, our directions are bounded up and down. This can be an advantage because it does not easily converge to local optimum. However, it will converge to the optimal point because it is unbiased statistics. This means that SGD has a low bias but high variance. We need to control the variance. There is one way to control it. We can control the step size. If we slowly reduce the learning rate, the variance can be controlled at the optimal point. We call it simulated annealing.

## Mini-batch Gradient Descent

Can we combine both methods? Yes, we can. Mini-batch Gradient Descent uses randomly picked subsets. It reduces the variance of SGD and its computation is less than Batch Gradient Descent.

# Vanishing & Exploding Gradient

Now, I explained how the training of neural networks works. However, there are two problems to implement this method directly. Since the gradients can be saturated, it will be smaller and smaller or larger and large. It causes the termination of the learning or going up instead of going down. Those problems are solved by Glorot and Bengio. They suggested changing the initializer and the activation function. What was the problem?

At that time, they use the logistic activation function and normal distribution initializer. They find out the variance of outputs from each layer is greater than the variance of inputs. Therefore, the output values are going to the right end or the left end. Look at the graph. The gradient of the right end or the left end is pretty saturated, it will give the answer close to zero.

## Initializer

Our problem was the variance of the inputs is much less than the variance of the outputs. Let’s imagine you randomly drawn 10 numbers from Gaussian Distribution, the mean is 0 and the variance is 1. Then, the variance of the output will be 10. Therefore, it causes the saturation of the activation function. They suggested different initializers. It reduces the variance of the output. They consider the number of inputs and the number of neurons in layers and make the He initializer and Glorot initializer. Their variances are 1/fan(avg) and 2/fan(in). fan(avg) means the average number of inputs and neurons. fan(in) means the number of inputs. They solve the saturation problem with controlling the variance of normal distribution.

## Activation Function

Now, we need to change the activation function because it is saturated easily at both ends. The first suggestion was the ReLU activation function. It is not saturated in the positive range. However, it was not perfect because the nodes died during training, it means the output from neurons is going to be zero. It happens especially when you set the big learning rate. The reason was the negative side always gives zero gradients.

To solve this problem, the Leaky ReLU is introduced. It has a positive gradient, alpha, in the negative range. The alpha can be a hyperparameter. Normally, the default value is 0.01. This value prevents the neurons from going to die.

Note: RReLu picks the alpha randomly. PReLU learns the alpha during training, the alpha is not hyperparameter anymore.

ELU outperformed all variants of ReLU. The difference is that it uses an exponential function in the negative range, we can control the exponential function with alpha too. Therefore, it does not give the non-zero gradient. It avoids neurons dying. If alpha is 1, we can differentiate everywhere in the function.

SeLU is a scaled version of ELU. Its output will be normalized, mean is 0 and variance is 1. However, there are a few conditions:

- The input must be standardized.
- The initializer must be the LeCun initializer.
- The network must be sequential. RNN is impossible to use SeLU.

**Note: **The activation functions have ranked in a normal situation. SELU>ELU>leaky ReLU>ReLU>tanh>Logistice. As I mentioned, you should be careful about the conditions of SeLU.

# Batch Normalization

Our training speed was improved by normalized values. Then, Can we just put another algorithm to normalize the output? Yes, that is Batch Normalization. It puts the algorithm in the neuron.

This is the algorithm. We calculated the mean and the variance of the mini-batch and normalize the values. The last step is to scale it and shift it with parameters. Those 4 parameters are learned during training but the mean and the variance are only used after the training. It also acts as a regularizer. This BN layer can make training slow but convergence is faster because it goes to the optimal point with less step.

# Transfer Learning

Do we have to always retrain neural networks every time for a different purpose? Let’s imagine we need to build neural networks for MRI and your colleagues have the neural networks for X-ray images. Can we recycle them? Yes, we can. If you think about the structure of neural nets, you would know the output layer and the last layers have a significant role to generate the output and the output layer can be various depending on your task. Therefore, we need to change the output layer of the pre-trained model and you need to test on how deep layers you can use. You freeze whole layers, make it non-trainable, and you test your data on the pre-trained model. You can decide which layers you can use with the result.

# Regularization

Neural Networks have massive parameters. It provides high accuracy but it causes also overfitting. Overfitting means low bias and high variance in the test data. We need to control the variance.

## l1 & l2 Regularization

I explained this regularization in this post

## Dropout

It improves your model accuracy by up to 2%. You can think this is not big but it is a lot of improvement. You will know if you stay in this field. The method is really easy. You can assign a probability to every neuron without output neurons. The probability means how likely the neurons will be un-activated in this step. The rule of thumb probability is 10~50%. RNN is 20~30%. CNN is 40%~50. It works because the neurons are trained to be tolerant of small changes. The neurons can work on different tasks because of the absence of his friends.

## Monte Carlo Dropout

Actually, the dropout method trains many models and use them as one model by combining their answers. However, MC Dropout pointed out this. Is it really confident about your result? It turns on the dropout and makes a prediction on the test set. It gives all different results because it is all different models. We average them and use it as the answer, it is the MC Dropout. You can get variance also.

## Max Norm Regularization

You can regularize the weight itself by hyperparameter and the l2 norm is used.

# Optimization

Until now, we didn’t touch the cost function and gradient descent function. Now, we can manipulate those functions for fast training.

## Momentum

Have you ridden a bike on the hill? When you got an acceleration, you cannot stop it until you met the flat ground or you stumbled. Can we accelerate the training? Yes, we can. We don’t use the gradient as speed. We will use it as acceleration. We update momentum by the gradient. v is called momentum. rho is the fraction of momentum, it controls how much we will use the momentum. Its range is 0~1.

## Nesterov Accelerated Gradient

If you have a drone possible to watch the geographic features of the hill you are going to go down, you definitely watch it. Nesterov is doing that. As you can see, the calculation of the gradient is not on the present point. It is on the estimated future point. It makes faster training than momentum in almost all cases.

## AdaGrad

We can change the learning rate during training. It does not use in Neural Net anymore but it influences a lot of methods. Other methods bounce a lot. They are also too fast to stop at global minima. It is designed to slow down the speed near the global optima. They store momentum and divide the learning rate. Therefore, the speed will go down during the training. Its major problem is slowing down a bit too fast.

**Note:** RMSprop solved this problem by storing a small portion of momentum.

## Adam & Nadam

Adam is a combination of AdaGrad and Momentum. Nadam is adam with Nesterov Accelerated Gradient.

If you need more explanation or more optimizer, you can check this: https://ruder.io/optimizing-gradient-descent/index.html#nadam

**This post is published on 9/27/2020**