Overview of optimizers for DNN: when and how to choose which optimizer.

Original article can be found here (source): Deep Learning on Medium

Part 1: Review of optimization methods of DNN

Section 1–1. Intuitive perspective of optimization

The goal of the optimization of DNN is to find the best parameters w to minimize the loss function f(w, x, y) subject to x, y, where x are the data and y are the labels. The gradient descent (GD) is the most frequently used method in Machine learning. In this method, we need another parameter called learning rate, α.

To sum up:

  1. w, the parameters to be optimized
  2. loss function f(w, x, y), we use f(w) bellow for simplification
  3. learning rate: α

Now we start gradient descent, at each batch/step t:

Let’s review all the optimization methods from this perspective in Section 2–1. The main differences between them lie on the step 1 and 2.

Section 1–2. Development of optimization

Batch gradient descent (BGD)

BGD calculate the step 1 on the entire training dataset for only one step. It can be very slow to converge.

If the training dataset is too large to fill into the memory, BGD becomes intractable. Moreover, BGD also doesn’t allow use to update out model online, for example, with new examples on-the-fly.

Stochastic gradient descent (SGD)

SGD is the completely opposite idea of BGD. SGD calculates the step 1 with only one sample. The calculation becomes faster, but the process of gradient descent becomes fluctuating, since the direction of gradient descent between not so stable with only one data. It can go the opposite direction, which means gradient ascent.

Mini-batch gradient descent

The tradeoff between SGD and BGD is mini-batch gradient descent. This method uses part of the data (n > 1) to calculate the gradient in step 1 and update the parameters w. This is the most popular setup used in modern machine learning training process. Now when we talk about SGD, it refers to SGD on mini-batch. So in the rest of this story, SGD means SGD on mini-batch.

SGD with momentum

Motivation: SGD has trouble in optimizing in ravine like surface curves. It is easily blocked in the saddle point.

From [1]

The left image above shows the trace of optimization. It’s easy to find that the vertical step is big and horizontal step is small. But what we need is a big step along the horizontal direction and small step along the vertical direction, which is depicted in the right image.

Solution: Calculate the update value for w with momentum. The implementation uses the method named exponential average. The exponential rate, β, is usually set to 0.9.

SGD with Nesterov acceleration

Motivation: There are hundreds of thousands of parameters for Deep learning model. When we optimize the parameters in such a high dimension space, it’s easy to fall into a local minimum. If we can give the optimizer the ability to lookahead, it gives us more chance to step out the local minimum. Let’s look at this description above in another perspective. In step 4, we update the parameters w with the gradient. SGD with Nesterov acceleration makes an approximation of current w by add the previous update value and calculate the gradient with the predicted approximation of w. Then the current update value is considered as a correction of the previous update value to current w.

More explanation can be found in [5] and [6]

Solution:

AdaGrad

Motivation: The momentum give SGD the ability to adapt the update value according to the history of gradient. But the learning rate is the same for all parameters in w. Can we update the parameters with different learning rate depending on their importance? Let’s imagine: during the training of network, we can update slowly for those parameters associated with frequent features and update quickly for those parameter associated with infrequent features. So all the parameters can converge in similar rhythm.

Solution:

AdaDelta/RMSprop

Motivation: The second order momentum calculated in AdaGrad is the accumulation of all the history. It becomes too big when the training process is long. This causes the update value infinitely close to 0. So we add exponential average to the calculation of the second order momentum to overcome this problem.

Solution:

Adam

Motivation: Adam is the most frequently used optimization, since it combines the SGD with momentum and RMSProp.

Solution:

Nadam

Motivation: Does Adam combine all the methods talked before? It seems we forget Nesterov. Let’s integrate it. It’s Nadam.

Solution:

Until now, we review most optimizers in DNN with the intuitive perspective.

Someone thinks the methods like Adam, SGD with momentum, etc. are the SGD with learning rate scheduler. My answer is Yes and No. Scheduled learning rate can have the same effect as SGD with momentum, but it can’t update parameters with different learning rate.