Source: Deep Learning on Medium
Sane Trials of Machine Learning on a Virtual Shoestring (3/3)
Usability: effectiveness, efficiency, and satisfaction.
TL;DR and for what you may not see in this post.
- Know (what are) self and others (part-1);
- Control PRNG states, not just their seed numbers (part-2);
- Uncover side effects, e.g.:
- — Mixed Precision, mind the gap of the dynamic loss scale ;
- — cuDNN, avoid using its built-in dropout;
- — Non-determinism/Instability of multi-core/process/thread usages.
This series of posts shares my thoughts on how to get deterministic outcomes of machine learning experiments at an affordable cost. As usual, most of the content is only my murmur. Feel free to skip them and check the Jupyter notebook for this series directly. It tries Universal Language Model Fine-tuning for Text Classification with a more recent version of fastai library. Another friendly reminder: technical jargons are just fancy names for references.
If we are already on a shoestring, what properties of a sane ML trial can ensure us happily ever after?
For me, knowing what is to be unknown will be a start. My inner security comes from embracing the known/intriguing unknowns (science!) without fear of unknown/undesired unknowns (who-must-not-be-named?). There’s gotta be a way to avoid the latter. When doing trial-and-error on a virtual shoestring, however, it is bound to be labor-intensive for leaving no stones unturned. Therefore, it is only natural that one will invent smart algorithms for a smaller time/space complexity. Except for all magic, always remember to negotiate the price in advance.
Not unlike the residents of Enchanted Forest, machine learning practitioners can’t afford to not know every fine print, a. k. a. side effect, in an agreement.
One of my favorite tricks is gradient-checkpointing, because it cares about what are the same nodes such that it can decide when (not) to recompute them. One of Hugging Face’s articles explains it well along with related tactics for efficiency. But what I want to point out is about checkpoints in a more general sense, at least for software engineering. A checkpoint trades processing time and programming efforts for memory space and fault tolerance. In other words, it checks and balances something critical, and for PyTorch it usually means gradients’ chain rule calculation.
So we have a chain of responsible nodes that are obvious subjects for check-and-balance. Yet sometimes we might sign up some more magics for a better-looking balance sheet. Since gradient-checkpointing could slow things down, why not use mixed precision for time and space at once? Here’s the fine print: if you want to apply dynamic loss scaling, own the side-effect. Take a look at fastai’s implementation of MixedPrecision class, you may find relevant snippets like:
self.loss_scale = ifnone(loss_scale, 2**16 if dynamic else 512)
ret_loss = last_loss * self.loss_scale
if self.dynamic and grad_overflow(self.model_params) and self.loss_scale > 1:
self.loss_scale /= 2
if self.noskip >= self.max_noskip and self.loss_scale < self.max_scale:
self.loss_scale *= 2
It exposes none of them to the outside world. If not everything goes as planned, we will lose track of
self.loss_scale. Once we realize it, the price will become how we keep track of it. Yet another coding trade-off between time and effort.
Speaking of the coding effort, it seems a no-brainer to use cuDNN APIs. Unfortunately, using them blindly can cause a major setback if you want to have a sane trial. One can easily find many testimonies to cuDNN’s built-in dropout (especially when using it with bidirectional models) on StackOverflow and github. “Don’t use it,” says the jury.
But why? Because multi-something-like-core implementations are again a burden of coding. Although cuDNN APIs provide deterministic counterparts of typical operations, there are too many switches to flip consistently. To my best knowledge, TensorFlow does not care about them. While PyTorch supports a welcome
cudnn.deterministic flag, it has no effect on the cuDNN built-in dropout either.
A part of concerns about identity lies in the definition of what effectiveness is. More specifically, what the expected outcome is. When only one core/process/thread is available, life is simple. Living in a world full of magic of parallel/distributed computing, we have to be careful with shared odds and ends. For example, we want consistent orders of shuffled instances from a multi-process dataset reader, but how random numbers behave between processes can surprise us.
Usable v. Useful
A classic paper of human-computer interaction argued, “What is beautiful is usable.” Please note that it deliberately said “usable” instead of “useful.” Machine learning and deep learning algorithms often come with pretty diagrams and elegant formulae. Many of them are indeed usable. Yet usability requires effectiveness, efficiency, and satisfaction. Or at very least for one’s sanity, the key question IMAO is: what makes one stir for a big pile of linear algebra the same to another?