Tumbling down the SGD rabbit hole — part 2

In the first part of this post, we studied the Stochastic Gradient Descent optimizer and discussed five problems that could hamper training of Deep Learning models:

1- Local minima,

2- Slow convergence,

3- Different slopes,

4- Saddle points,

5- Gradient size & distributed training.

In this post, we’ll discuss solutions to these problems and how to apply them.

You know you’re having a bad day when the light at the end of the tunnel is actually a black hole…

1- Local minima

The problem posed by local minima in deep neural networks has been debated for a long time. However, intuition tells us that they should be rare. Indeed, the number of parameters in such networks is very large (millions at least): for a point to be a local minimum, all dimensions should have a positive slope. The probability of this happening would be the inverse of 2 to the power of the number of dimension. The inverse of 2^1,000,000? Of 2^10,000,000? Pretty slim chance…

Based on this, one could also conclude that using larger-than-needed models would actually be a good idea: more parameters would mean lower probability for local minima, right? Indeed, a 2015 paper by Itay Safran and Ohad Shamir (“On the Quality of the Initial Basin in Overspecified Neural Networks”) shows that this is the case: “higher dimensions also means more potential directions of descent, so perhaps the gradient descent procedures used in practice are more unlikely to get stuck in poor local minima and plateaus”.

Another paper published in 2014 paper by Ian J. Goodfellow, Oriol Vinyals and Andrew M. Sax (“Qualitatively characterizing neural network optimization problems”) concludes empirically that local minima are not a problem when training deep neural networks. Yes, they may exist but in practice they’re hardly ever encountered during SGD. Even when they are, it does just fine escaping them, thank you.

When all is said and done, please remember one thing: the purpose of the training process is not to find *the* global mimimum — this is a NP-hard problem, so forget about it. It is to find *a* minimum that generalizes well enough, yielding a test accuracy compatible with our business problem.

Stop worrying and learn to love SGD!

2 & 3 – Slow convergence and different slopes

These problems are related: for very high dimension problems such as deep neural networks, it will take a long time to reach an acceptable minimum if you use a small fixed learning rate.

Over the years, a number of improvements were designed to speed up SGD (Robbins and Monro, 1951). Techniques like Momentum (Polyak, 1964) and Nesterov Accelerated Gradient (Nesterov, 1983) were designed to accelerate progress in the direction of steepest descent.

As Deep Learning gained popularity and as networks grow larger, researchers realized that some dimensions had steep slopes while some exhibited vast plateaus. Thus, they worked on adapt to these different conditions by not only modifying the learning rate but also by using different learning rates for different dimensions.

Enter the Ada* family!

“We’re a happy family, we’re a happy family, hey Mom and Daddy”

Adaptive optimizers

AdaGrad (2011, PDF) came first. For a given parameter, it increases the learning rate if it receives small updates (i.e. speed up on plateaus) and decrease it if it receives larges updates (i.e. slow down on steep slopes). This is achieved by dividing the learning rate by the sum of all squared past gradient updates (aka the l2 norm): larger gradient updates will reduce the learning rate while tiny gradient updates will increase it.

AdaGrad has a problem, however: as the sum grows monotonically, the learning rates will end up converging to zero for large models and long trainings, effectively preventing any further progress.

RMSProp (2012) and AdaDelta (2012) solve this problem by decaying the accumulated sum before adding the new gradient. This prevents the sum from exploding and the learning rates from going to zero.

Adam (2015) adds momentum to AdaDelta, further focusing progress in the direction of steepest descent.

Here’s a great visualization. The adaptive optimizers race to the minimum, momentum and NAG go for a walk in the park before heading out to the right spot… and Grand Pa SGD gets there too but really slowly.

Source: Alec Radford

All these optimizers are available in your favorite Deep Learning library. Adam is a popular choice as it seems to work well in most situations. Does it mean that it can’t be improved? Of course not: research never sleeps and this nice post by Sebastian Ruder lists the latest developments.

SGD strikes back?

One of these developments is surprising. Some researchers now question that adaptive optimizers are the best choice for deep neural networks, as illustrated by “The Marginal Value of Adaptive Gradient Methods in Machine Learning” (Ashia Wilson et al., 2017).

This paper show that well-tuned SGD with learning rate decay at specific epochs ends up outperforming all adaptive optimizers: “Despite the fact that our experimental evidence demonstrates that adaptive methods are not advantageous for machine learning, the Adam algorithm remains incredibly popular. We are not sure exactly as to why, but hope that our step-size tuning suggestions make it easier for practitioners to use standard stochastic gradient methods in their research”.

Now, of course, figuring out the learning rate decay schedule is another complex problem in itself, so I wouldn’t bury Adam just yet 🙂

And now for something completely different…

The FTML optimizer

So far, we’ve only studied SGD and its variants. Of course, there are other techniques out there. One of them is the Follow the Moving Leader algorithm aka FTML (“Follow the Moving Leader in Deep Learning” by Shuai Zheng and James T. Kwok, 2017, PDF).

I won’t go into details in this already long post, but let’s take a quick look at some of the results: learning CIFAR-10 with LeNet and ResNet-110.

LeNet CNN training on CIFAR-10
ResNet-110 training CIFAR-10

Is there a new sheriff in Optimizer City? You’ll find more results in the research paper (tl;dr: FTML matches or surpasses Adam on CIFAR-100 and on LSTM tasks).

FTML is available in Apache MXNet 1.1. Using it is as simple as:


4- Saddle points

Noisy SGD

In 2015, Rong Ge et al. proposed a noisy version of SGD (“Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition”): when updating parameters, adding a tiny amount of noise on top of the gradient is enough to nudge SGD to a slightly lower point instead of getting stuck at the saddle point. Very cool idea 🙂

Off the convex path” — Rong Ge’s blog — is an excellent resource. Check it out.

Random initialization

In 2016, Lee et al. (“Gradient Descent Converges to Minimizers”) showed that even without adding noise, random initialization of parameters lets SGD evade first-order saddle points: “gradient descent with a random initialization and sufficiently small constant step size converges to a local minimizer or negative infinity almost surely”.

All Deep Learning libraries will let you do this. For example, you could use either of these in Apache MXNet:


Breaking out of high-order saddle points

Noisy SGD and random initialization save us from computing the Hessian, which is a costly operation slowing down the training process. Still, in the previous post, I gave you an example of a third-order saddle point defeating the Hessian. Is there hope for these?

In 2016, Anima Anandkumar (a Principal Scientist at AWS) and Rong Ge proposed the first method to solve third-order saddle points: “Efficient approaches for escaping higher order saddle points in non-convex optimization”. I’m not aware of any Deep Learning library implementing this technique, but please feel free to get in touch if there’s one 🙂

5- Gradient size & distributed training

Last month, Jeremy Bernstein et al. proposed two variant of SGD called SignSGD and Signum (adding momentum) : “SIGNSGD: compressed optimisation for non-convex problems”. They solve the gradient size problem by sending only the sign of the gradient, not its 32-bit value! Thus, this reduces the amount of data to send by a factor of 32. The network says thanks 😉

Amazingly, these algorithms converge at the same rate or even faster than Adam, only losing out to highly-tuned SGD (hmmm, again). Here are some results:

Learning ImageNet with Signum

SignSGD is available in Apache MXNet 1.1. Using it is as simple as:


Another interesting paper was published in late 2017 by Yujun Lin et al. : “Deep Gradient Compression: reducing the communication bandwidth for distributed training”. This is based on a complete different technique: gradients are only sent when updates reach a certain threshold value. Good read.


We covered a lot of ground again. Let’s try to sum things up.

  • Local minima are not a problem in very large networks. A word of warning: larger networks will be more expensive to train and will overfit the training set faster.
  • Adaptive optimizers solve the headache of picking a fixed learning and they converge faster.
  • Adam is a popular choice but FTML is the new kid on the block and it means business. Add it to your lineup.
  • SGD can still deliver the best accuracy, at the expense of a lot of complicated tuning. Hopefully hyper-parameter optimization (available in preview in SageMaker) will solve this for us.
  • Saddle points is the real problem we didn’t know we had :-/
  • Combining random weight initialization and a very small learning rate is a proven technique to avoid first-order saddle points.
  • Modern optimizers should be able to break out of first-order and second-order saddle points. It doesn’t look like we have an off-the-shelf solution for higher-order ones at the moment.
  • Sharing gradients during distributed training can become a severe performance bottleneck: the SignSGD optimizer solves the issue with no loss of accuracy.

That’s it for today. I hope that you learned a lot and that these techniques will help you build better models.

As always, thanks for reading! Please feel free to reach out on Twitter.

This should be my intro tape at AWS events 😀

Source: Deep Learning on Medium