Original article can be found here (source): Deep Learning on Medium
[DL] 7. Optimization Techniques on Gradient Descent and Learning Rate
(1) Learning vs Optimization
Optimization is simply to reduce the training error whereas, the learning is to reduce the generalization error while keeping the training error small.
(2) Cost Function
The empirical risk is defined over the entire training set of m samples. But in most cases, we do not use the whole dataset to make single weight-update and, therefore, we do not compute this expected value exactly. Furthermore, computing empirical risk over the whole dataset can be severally expensive to evaluate the model as we have millions of training images in modern vision datasets.
As a result, using a subset of the dataset which is a technique called mini-batch to evaluate the model and to update the parameters(weights) is more commonly used. In the majority of cases, the learning algorithms converge faster with this way since we use the rapid approximation of gradients, not the slower exact gradients.
(4) Exploding gradients
The cost function of neural networks with many layers and non-linear activations often have very steep regions in the parameter(weight) space. Such cliffs and sharp regions are the result of multiplying large weights together from many layers.
The problem of those cliffs happens when the parameters(weights) get close to such sharp regions as they return significantly large gradients. Because of large gradients, the updated parameter might go unexpectedly far away from the minima since the step-size of parameter-update made by the gradient descent algorithm depends on the norm of gradient with respect to such a parameter. This often cause the loss of the previous optimization works on parameters.
- How to solve the gradient exploding issue?
This issue can be resolved using the gradient clipping. It is simply not allowing the weight update which is larger than the predetermined threshold. More intuitively, the gradient clipping caps the step on the parameter space to prevent weights from moving too far away.
To get more information regarding SGD, please take a look at the following link: SGD
(6) Learning Rate
The learning rate 𝛜 is one of the most crucial hyperparameters of the neural networks. It defines how large or small the step-size in the parameter space is. The larger the learning rate is, the farther parameters get updated. In practice, it is common to decay the learning rate over iteration.
This is reasonable since in the early stage of the learning, the weights are likely to be far away from the minima therefore, large step-size is desirable. Whereas, in the end phase of learning the weights are already close to the minima. So in this case, making small updates on weights(parameter) would increase the chance of reaching the minim. We will deal with some techniques that how we change the learning rate over the iteration at the later part of this chapter.
Figure 5 illustrates the behavior of loss function over the epoch by the different values of the learning rate. With an appropriately set learning rate, it shows a descending trend which has a dramatic decrease in the first few epochs and then slowly converges at the end phase of the training.
2. SGD with momentum
SGD is great but it often is not as enough fast as we want. In order for SGD to boost the training, we can apply the concept of momentum.
The momentum 𝛎 accumulates the gradients from the first iteration until the current step with exponential decay term 𝛂. What this accumulation in 𝛎 is that if the calculated directional derivative g in current iteration has the same direction as 𝛎 then we make the weight-update in the same direction as g but with the larger size of step, meaning that we update weights with bigger values in parameter space compared to when we update weights without momentum 𝛎. Therefore, eventually, the SGD reaches the destination faster with momentum 𝛎 than when it is without 𝛎.
Let’s take a closer look at its weight-update rule and graphs.
Exponential decay term 𝛂 ∈ [0, 1) determines how fast the contribution of previous accumulated gradients decay.
As is shown in the update-rule, if we always get the gradient g, then the 𝛎 gets accelerated in the direction of -g. For example, the initial value of 𝛎 is zero and we get the gradient g in every iteration. Then the momentum 𝛎 in second iteration = 𝛂𝛎 – 𝛜g = -𝛜g, and, in the third iteration, 𝛎 = 𝛂𝛎-𝛜g = 𝛂(-𝛜g)-𝛜g = -𝛜g(1+𝛂) which is larger then -𝛜g. Therefore, SGD with momentum makes a larger update on the weight 𝚯 at the parameter space if the same gradient is repeatedly obtained, and this, eventually, leads SGD to reach the minima faster. The figure 6 is an illustration of the above explanation.
There is one variation of momentum method called Nesterov momentum.
Again, let’s take a look at the update-rule of Nesterov Momentum and find the difference.
The only difference from the previously mentioned momentum method is that the gradients are evaluated from the loss function after applying the current velocity on weights.
With this change, the Nesterov momentum tries to apply a correction factor to the standard momentum, in hopes of speeding up convergence. If the gradient is convex and computed from the entire set, then Nesterov momentum can speed up convergence.
3. Adaptive Learning Rate
As I’ve mentioned in the chapter ‘1.(6): learning rate’, large value is desirable for the learning rate at the beginning of learning whereas the small value is appropriate for the learning rate at the later phase of learning. This means that we need to decrease the value of the learning rate as the training proceeds. There are three representative methods which adaptively change the learning rates over the epochs.
: divides a fixed learning rate by the square root of sum of all previously obtained gradients.
: Changes the gradient accumulation part of AdaGrad with an exponentially weighted moving average.
: adds the momentum to RMSProb
 Bishop. Pattern Recognition and Machine Learning. Springer, 2006
Any corrections, suggestions, and comments are welcome
Contents of this article are reproduced based on Bishop and Goodfellow