Meet DiffGrad: New Deep Learning Optimizer that solves Adam’s ‘overshoot’ issue

Source: Deep Learning on Medium

Meet DiffGrad: New Deep Learning Optimizer that solves Adam’s ‘overshoot’ issue

Example of short term gradient changes on the way to the global optimum (center). Image from paper.

DiffGrad, a new optimizer introduced in the paper “diffGrad: An optimizer for CNN’s” by Dubey, et al, builds on the proven Adam optimizer by developing an adaptive ‘friction clamp’ and monitoring the local change in gradients in order to automatically lock in optimal parameter values that Adam can skip over.

Comparison of results, 300 epochs (from the paper). Note the esp large improvement for CIFAR 100 vs Adam and SGD with Momentum (red column).

When local gradient changes begin to reduce during training, this is often indicative of the potential presence of a global minima. DiffGrad applies an adaptive clamping effect to lock parameters into global minima, vs momentum only optimizers like Adam which can get close, but often fly right by due to their inability to rapidly decelerate. The result is out-performance vs Adam and SGD with momentum, as shown in the test results above.

Training fast but with some regret: Adam and other ‘adaptive’ optimizers rely on computing an exponential moving average of the gradients, which allows it to take much larger steps (or greater velocity) during training where the gradients are relatively consistent vs the fixed, plodding, steps of SGD.

On the positive, Adam can thus move a lot faster and sooner towards convergence relative to SGD. That’s why Adam is the usual default for most deep learning optimizers as it can get you pretty quickly to a reasonable solution.

However, the downside of this acceleration is the risk of going right over the ideal global minima, or true optimal solution, due to the inherent inability for an exponential moving average to rapidly slow down if needed. The current update for an Adam step is often based only on 10% of the current gradient and 90% from the previous gradients.

This is also where in some cases, SGD, while slow, can end up with a better final result because it will plod along but when it reaches a global minima, not jump out of the global minima (but takes a long time to get there).

Locking in to optimal minima with ‘friction clamping’: By contrast, diffGrad monitors the immediate change of the current gradient versus the previous step, and applies an adaptive ‘friction clamp’ that can rapidly decelerate when the gradient change is small, and thus implies an optimal solution may be close by.

diffGrad’s friction clamp, version 0 — values between +5 and -5 are rapidly de-celerated. Larger values remain untouched and perform at the same speed as regular Adam.

By rapidly decreasing the learning rate adaptively, diffGrad can thus help parameters lock into global minima and reduce the real issue of ramping right over it due to the inability of Adam and similar to decelerate. (note this is one reason learning rates are traditionally decayed over epochs, to help allow for parameters to ‘settle in’ over time).

Adam vs diffGrad on synthetic landscapes: Running diffGrad on three different synthetic functions can show how diffGrad is better able to lock parameters into more optimal results.

Synthetic function test — diffGrad is able to lock in to the global minimum with ideal loss. Adam skips over into a higher local minima due to being unable to decelerate in time. (image from papers, annotations added)
Additional test — diffGrad locks into the global minimum, Adam skips past and ends up in local minimum.

In the above examples you can see multiple loss landscapes and both Adam and diffGrad are run. In both cases, Adam is unable to decelerate in sufficient time and end’s up moving past the optimal solution and settling into a less optimal minima.

These examples thus show diffGrad’s advantage of monitoring the immediate gradient landscape. By being able to apply a friction clamp and rapidly decelerate, overshooting an optimal solution can be avoided.

Thus, parameters become locked in to better weights and thus higher net accuracy for the NN.

Escaping local minima and saddle points: The friction clamping is smooth as shown in the function map above, with the idea that it will allow enough deceleration to stick if it’s an optimal minima, while still preserving enough velocity to escape if it’s only a local minima or a saddle point (equal gradients on each side that fail to provide direction).

Several other papers have addressed proposed solutions to the known overshoot issue, but diffGrad solves it elegantly and robustly.

DiffGrad Variants: Note that the paper also delves into several other variants in terms of how to apply the clamping or friction coefficient. The code provided in the paper’s official github only offers the version 0, which is the version used above.

In testing on FastAI datasets, I found that version 1 (which allows the friction clamp to tighten down further) outperformed version 0 and thus I have added a version flag in my unofficial implementation so you can test both on your datasets.

version flag added in my unofficial implementation. I find version 1 performs better on the datasets I tested.

Tips for use: After testing a flat learning rate, FastAI’s triangular schedule (fit_one_cycle), and the flat+ anneal we used to beat the previous FastAI leaderboard records — so far flat+anneal works the best. (You can use fit_fc() if you are on the latest version of FastAI, or I have added our flattenAnneal function in the diffGrad playground notebook in my repository…links below).

Using diffGrad v1 (the default is v0), I was able to quickly get within 1% of the FastAI record for 20 epochs. Considering the amount of tuning with learning rates done for Ranger vs no tuning done for diffGrad, I’m impressed:

20 epoch run — re-used the Ranger learning rate with diffGrad and got within 1% of the Global leaderboard results, with no other tuning.

Source code links:

1 — Official repository: https://github.com/shivram1987/diffGrad

2 — Unofficial diffGrad with v1 option and FastAI usage notebook:

https://github.com/lessw2020/Best-Deep-Learning-Optimizers/tree/master/diffgrad

Example usage:

Note that I’ll likely make a short video showing how to use diffGrad as well for those who would like to see a more hand’s on tutorial.

Summary: DiffGrad provides an innovative solution to a known weakness of adaptive optimizers like Adam, namely their potential risk of accelerating right past optimal minima.

By monitoring the immediate gradient landscape, diffGrad adaptively decelerates the optimizer to help lock in global minima and thus allow for faster, better training for your deep learning networks!