Adam optimizer? Yes, but rectified please!

Source: Deep Learning on Medium

Adam optimizer? Yes, but rectified please!

Liu, Jian, He e.a. describe a new variation of well-known adam optimizer for training neural networks in their rather new paper “On the Variance of the Adaptive Learning Rate and Beyond” (published on 8 Aug 2019).

They describe their findings in respect to the effects of variance and momentum during training when using adam optimizer. RAdam (or rectified adam) provides a new technology for adopting the learning rate baed on automated, dynamic adjustment. With RAdam the training of any neural net should be improved in comparison to using plain vanilla Adam optimizer.

The main goal of RAdam is more stability and convergence to choosing the learningrate parameter. The main problem of using adaptive learning rate optimizers including Adam, RMSProp, etc. is the difficulty of being stuck on local minima while not converging to the global minimum. More modern frameworks (like fastai) include a warming-up-phase within their training methods. During this phase different learning-rates are used for training.

The authors of the paper examined this rather less understood topic of warming-up heuristics. They found that optimizers with adaptive learning-rates tend to having too large values on variance. This especially inflicts the early stages of the training process and is responsible for letting the optimizer doing big jumps. These can lead to bad decisions of the optimizer and being stuck on local optima instead of finding global minima.

To avoid this issue wamup phases are implemented into the optimizers during which a much lower learning rate is used. This visualization shows the internals of Adam Optimizer — with and without using a warm-up phase — during ten iterations.

We can see that the plain vanilla implementation of Adam optimizer may make bad decisions in early training phases because of too little data being seen during training.

This authors of the paper could reproduce similar results when using Adam without warm-up and without using momentum for the first 2k iterations.

This led to the point that warm-up acts like some kind of variance reduction for the Adam optimizer. This reduced variance obviously can prevent Adam from being stuck on local optima.

Let’s rectify Adam!

But one major problems remains: we do not know the degree of warm-up that is required because this varies from dataset to dataset. This way the authors designed a mathematical algorithm that is capable of managing the degree of dynamic variance. This exactly is what is described as rectified. This rectifier term can slowly and continously decrease the adaptive momentum.

RAdam is able of dynamically managing the adaptive learning rate. This is done by analyzing the underlying divergence of the variance. This way RAdam uses a dynamic warm-up phase without the need of tuning any parameters. The figure below shows that RAdam outperforms Adam with conventional warm-up tuning.