Source: Deep Learning on Medium

**2 — Mixed Precision Training**

Larger models usually require more compute and memory resources to train. These requirements can be lowered by using reduced precision representation and arithmetic.

Performance (speed) of any program, including neural network training and inference, is limited by one of three factors: arithmetic bandwidth, memory bandwidth, or latency. Reduced precision addresses two of these limiters. Memory bandwidth pressure is lowered by using fewer bits to store the same number of values. Arithmetic time can also be lowered on processors that offer higher throughput for reduced precision math. For example, half-precision math throughput in recent GPUs is 2× to 8× higher than for single-precision. In addition to speed improvements, reduced precision formats also reduce the amount of memory required for training.

Modern deep learning training systems use a single-precision (FP32) format. In their paper “Mixed Precision Training,” researchers from NVIDIA and Baidu addressed training with reduced precision while maintaining model accuracy.

Specifically, they trained various neural networks using the IEEE half-precision format (FP16). Since FP16 format has a narrower dynamic range than FP32, they introduced three techniques to prevent model accuracy loss: maintaining a master copy of weights in FP32, loss-scaling that minimizes gradient values becoming zeros, and FP16 arithmetic with accumulation in FP32.

Using these techniques, they demonstrated that a wide variety of network architectures and applications can be trained to match the accuracy of FP32 training. Experimental results include convolutional and recurrent network architectures, trained for classification, regression, and generative tasks.

Applications include image classification, image generation, object detection, language modeling, machine translation, and speech recognition. The proposed methodology requires no changes to models or training hyperparameters.

**3 — Model Distillation**

Model distillation refers to the idea of model compression by teaching a smaller network exactly what to do, step-by-step, using a bigger, already-trained network. The ‘soft labels’ refer to the output feature maps by the bigger network after every convolution layer. The smaller network is then trained to learn the exact behavior of the bigger network by trying to replicate its outputs at every level (not just the final loss).

The method was first proposed by Bucila et al., 2006 and generalized by Hinton et al., 2015. In distillation, knowledge is transferred from the teacher model to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model. That is — the output of a softmax function on the teacher model’s logits.

So how do teacher-student networks exactly work?

- The highly-complex teacher network is first trained separately using the complete dataset. This step requires high computational performance and thus can only be done offline (on high-performing GPUs).
- While designing a student network, correspondence needs to be established between intermediate outputs of the student network and the teacher network. This correspondence can involve directly passing the output of a layer in the teacher network to the student network, or performing some data augmentation before passing it to the student network.
- Next, the data are forward-passed through the teacher network to get all intermediate outputs, and then data augmentation (if any) is applied to the same.
- Finally, the outputs from the teacher network are back-propagated through the student network so that the student network can learn to replicate the behavior of the teacher network.