“Training takes time” ← That’s just the way it is, and the way it’s always been. The thing is, it’s not just the time you spend waiting for that bloody model to train, it’s the knock-on effects thereof.
It is exceedingly common to spin off multiple workflows at a time, each with it’s own tweak, and that leads to any amount of chaos when it comes to keeping track of what exactly changed. Reproducibility is, well, a bit of a PITA in this field 🙄. In fact, you can even get tooling just to work around the chaos of tracking all these pipelined workflows. Ouch!
The underlying issue here is that typically training is done in one of two ways
1) Throw everything at a single server/GPU (what you do if that’s what you’ve got)
2) Throw it at a server farm (AWS, GCE, whatever) with their concomitant GPUs/TPUs/whatever
Take ResNet-50. You’d typically break the images up into “minibatches” of 256 images each, and then have at it, iterating through your training set. This takes time though. Oh, less time than it would if you ran everything on just one GPU, but it’s still a lot (e.g., around 29 hours with 8 Tesla P100 GPUs).
You could increase the minibatch size, but as Krizhevsky¹ shows, this inevitably results in much greater error rates. Which kinda defeats the purpose 😡.
Or, rather, it used to result in much greater error rates. In a recent paper², Goyal et al. show that you can actually raise the minibatch size to 8192 images without compromising the results. The result — a dramatic reduction in training time from 29 hours down to 1 hour with the same setup as above!
The best part being that, when scaled up to 8192, they
observed no generalization issues when transferring across datasets (from ImageNet to COCO) and across tasks (from classification to detection/segmentation)
In short, all the good stuff, and none of the bad stuff.
The key, as it turns out, was twofold
- When the minibatch size is multiplied by
k, multiply the learning rate by
k. (up to 8192. After that accuracy goes down the tubes)
- The learning rate actually gets ramped up over 5 epochs to it’s final value. Turns out that this promotes healthy convergence, and prevents the model from lurching all over the place.
And that’s pretty much it. 30x improvement FTW. Simple no?
Ok, maybe it’s not quite so simple. It turns out that there are a couple of things that you really need to pay attention to when you’re going down this road.
- Momentum SGD: If using it, you need to apply the momentum correction after changing the learning rate, otherwise…instability.
- Data Shuffling: Use a single random shuffling of the training data (per epoch) that is divided amongst all k workers. Otherwise, well, your workers are all training against different things!
- Weight Decay: Scaling the cross-entropy loss is not equivalent to scaling the learning rate. Be careful here!
- Gradient Aggregation: Normalize the per-worker loss by total minibatch size
k*n, not per-worker size
And yes, communication does become an issue with these larger batch sizes, especially during Gradient Aggregation. The authors do some fairly neat stuff here using NCCL and a couple of neat algorithmic tricks — recursive halving and doubling, combined with bucketing — to simplify the whole process and make it efficient.. The code for this is available on GitHub as Gloo. And, of course, given that this is Facebook, everything is based off of Caffe2 😀
This is definitely good news, and I expect to see this percolate into training everywhere, to the point that it’ll just be par for the course in a wee bit!
- “One weird trick for parallelizing convolutional neural networks” — by Alex Krizhevsky
- “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” — by Goyal et al.
Source: Deep Learning on Medium