Effect of Gradient Descent Optimizers on Neural Net Training

Original article was published by Daryl Chang on Deep Learning on Medium


We observe cyclic oscillations in the training loss, due to the cyclic changes in the learning rate. We also see these oscillations to a lesser extend in the validation loss.

Best CLR training and validation loss

  • Best validation loss: 0.2318
  • Associated training loss: 0.2267
  • Epochs to converge to minimum: 280
  • Params: Used the settings mentioned above. However, we may be able to obtain better performance by tuning the cycle policy (e.g. by allowing the max and min bounds to decay) or by tuning the max and min bounds themselves. Note that this tuning may offset the time savings that CLR purports to offer.

CLR takeaways

  • CLR varies the learning rate cyclically between a min and max bound.
  • CLR may potentially eliminate the need to tune the learning rate while attaining similar performance. However, we did not attain similar performance.

Comparison

So, after all the experiments above, which optimizer ended up working the best? Let’s take the best run from each optimizer, i.e. the one with the lowest validation loss:

Figure 45: Best validation loss achieved by each optimizer.

Surprisingly, SGD achieves the best validation loss, and by a significant margin. Then, we have SGD with Nesterov momentum, Adam, SGD with momentum, and RMSprop, which all perform similarly to one another. Finally, Adagrad and CLR come in last, with losses significantly higher than the others.

What about training loss? Let’s plot the training loss for the runs selected above:

Figure 46: Training loss achieved by each optimizer for best runs selected above.

Here, we see some correlation with the validation loss, but Adagrad and CLR perform better than their validation losses would imply.

What about convergence? Let’s first take a look at how many epochs it takes each optimizer to converge to its minimum validation loss:

Figure 47: Num epochs to converge to minimizer.

Adam is clearly the fastest, while SGD is the slowest.

However, this may not be a fair comparison, since the minimum validation loss for each optimizer is different. How about measuring how many epochs it takes each optimizer to reach a fixed validation loss? Let’s take the worst minimum validation loss of 0.2318 (the one achieved by CLR), and compute how many epochs it takes each optimizer to reach that loss.

Figure 48: Number of epochs to converge to worst minimum validation loss (0.2318, achieved by CLR).

Again, we can see that Adam does converge more quickly to the given loss than any other optimizer, which is one of its purported advantages. Surprisingly, SGD with momentum seems to converge more slowly than vanilla SGD! This is because the learning rate used by the best SGD with momentum run is lower than that used by the best vanilla SGD run. If we hold the learning rate constant, we see that momentum does in fact speed up convergence:

Figure 49: Comparing SGD and SGD with momentum.

As seen above, the best vanilla SGD run (blue) converges more quickly than the best SGD with momentum run (orange), since the learning rate is higher at 0.03 compared to the latter’s 0.01. However, when hold the learning rate constant by comparing with vanilla SGD at learning rate 0.01 (green), we see that adding momentum does indeed speed up convergence.

Why does Adam fail to beat vanilla SGD?

As mentioned in the Adam section, others have also noticed that Adam sometimes works worse than SGD with momentum or other optimization algorithms [2]. To quote Vitaly Bushaev’s article on Adam, “after a while people started noticing that despite superior training time, Adam in some areas does not converge to an optimal solution, so for some tasks (such as image classification on popular CIFAR datasets) state-of-the-art results are still only achieved by applying SGD with momentum.” [2] Though the exact reasons are beyond the scope of this article, others have shown that Adam may converge to sub-optimal solutions, even on convex functions.

Conclusions

Overall, we can conclude that:

  • You should tune your learning rate — it makes a large difference in your model’s performance, even more so than the choice of optimizer.
  • On our data, vanilla SGD performed the best, but Adam achieved performance that was almost as good, while converging more quickly.
  • It is worth trying out different values for rho in RMSprop and the beta values in Adam, even though Keras recommends using the default params.

References

[0] https://www.deeplearningbook.org/contents/optimization.html

[1] Diederik P. Kingma and Jimmy Lei Ba. Adam : A method for stochastic optimization. 2014. arXiv:1412.6980v9

[2] https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c

[3] https://ruder.io/optimizing-gradient-descent/index.html#adagrad

[4] Leslie N. Smith. https://arxiv.org/pdf/1506.01186.pdf