Source: Deep Learning on Medium
There are many ways you can change your code to improve training performance. However, we’re going to talk about performance gains related to your testbed’s configuration (both hardware and software).
In our system, performance improved by 2.5X across two checkpoints 9 months apart.
- We didn’t change the hardware at all.
- We didn’t change the scripts we were running.
- We did upgrade our software stack and hardware drivers.
It’s like we built a car, and we’re still driving the same car, but the roads are better.
Lesson # 1: Know your bottleneck
Every training job is a pipeline of tasks. A pipeline needs to be balanced so there’s not a piece that’s weaker than the others. If you don’t segment the pipeline and test individual components separately, you won’t know which piece is lagging and bringing down your overall throughput.
One way to start breaking down the pipeline is to see how far you are from ideal / optimal performance. For example, we ran our script with synthetic data (“fake” data generated on the GPU) and the throughput was actually quite close to that optimal performance. So, because that synthetic data test without data loading work performed so much better, we knew that data loading must be a bottleneck.
Now that we had identified a problem area, we were able to tune some TensorFlow parameters that affect data processing tasks specifically (like prefetch queue and thread pool settings).
Lesson # 2: Consume new software
The software stack is improving all the time. Great work is being done by dev teams and the community. Take advantage of that work!
For example, with earlier TensorFlow versions (e.g. 1.4.0), distortions had a significant performance impact. With newer TensorFlow versions (e.g. 1.10.0 and beyond), distortion tasks are near-invisible and hardly impact training throughput.
You can consume these performance improvements for free by upgrading your libraries!
Lesson # 3: Ask more from libraries over time
Things that didn’t used to be possible may be possible now.
For example —
If you use a larger batch size, you can complete an epoch of training sooner.
So why doesn’t everybody use a larger batch size all the time? Because they may run into memory issues.
Luckily, the software stack is evolving to support larger batch sizes. Back in February 2018, we weren’t even able to run with batch size 256 due to OOMs 🙁
You should try things now that may have failed 6 months ago on older software stacks.
Lesson # 4: Stay aware of major developments
For example, the switch from FP32 (slower) to FP16 (faster) is one that many development teams can make, especially if training data collected by sensors or devices is low-precision to begin with.
Lower precision = less data = faster jobs.
Lesson # 5: Consider the entire stack
You may think your system looks like this:
But it’s also…
And it’s also…
→ Every layer is an ingredient to overall training performance.
*Bonus Lesson* (learn from Emily’s mistakes)
If you’re not paying attention to networking, you may accidentally run your training job over the 1 Gbe management network path instead of the 100 Gbe data network path. (“They didn’t teach me about networking in my DL courses!”)
1/100th training performance is not very appealing.
Hopefully that mistake (one I’ve made more than once :/) just helps show that the entire end-to-end system can impact performance.
When’s the last time you refreshed your software stack? 3 months ago? 6 months ago?
Remember that there are performance gains to be made with fairly little effort.
Tools like the NVIDIA GPU cloud container registry can be a one-stop-shop where data scientists obtain a single Docker image with the latest libraries and tools pre-integrated. It’s a brainless, painless way to stay up to date with the latest libraries and drivers. You should use it.