Scaling DeepSpeech using Mixed Precision and KubeFlow

Source: Deep Learning on Medium


Go to the profile of Sean Naren

Over the past few years at Digital Reasoning we have been developing audio analytics software to be highly effective at processing the noisy, domain-specific voice data that we typically encounter within the trading operations of major banks. Within the Audio Research Team, rapid research cycles drive continual refinements to our audio technology. The faster we can iterate, the better the quality of the solutions we deliver to our customers.

Our audio development pipeline contains various deep learning models trained on large volumes of audio and text datasets, to feed features to our downstream NLP models. A crucial portion of our audio pipeline is our automatic speech recognition (ASR) model.

Once trained, our ASR models can serve as an out-of-the-box model to customer deployments, fine-tuned to a particular environment or set of speakers. End-to-End deep learning-based ASR models regularly outperform traditional methods, but training involves massive amounts of computation, training data and time. The ability to experiment and iterate quickly to produce the highest quality models is one of our key focuses. This article will go further into depth concerning this, focusing on our advancements in performance and showing how this has resulted in significant improvements in training times.

We favour utilising open source software and contribute back as much as possible. Many great libraries and projects have been released allowing us to easily improve iteration times and quality. The changes discussed in this article have been contributed to deepspeech.pytorch, an openly available PyTorch implementation of DS2.

End-to-End Speech Recognition

One of the most successful end to end architecture is DeepSpeech 2 (DS2), popularised by Baidu for being an end-to-end Deep Learning model for ASR requiring no pre-training or alignment between labelled audio and text. However, DS2 is notoriously expensive to train due to its hybrid convolutional/RNN architecture which contains 50+ million parameters, not to mention the large quantities of data required to train. However, end-to-end Deep Learning architecture provides a significant benefit in fine-tuning/transfer learning, and an overall robustness in various environments.

Since the release of the original DS2 paper in late 2014, there have been many new advancements: for instance, improvements in GPU throughput and utilisation has resulted in dramatically faster training times. This means faster iteration times, and an increase in the number of experiments that can be evaluated/tested, in turn leading to faster research.

Mixed Precision

NVIDIA’s recent breakthroughs with the Volta architecture offers the potential for up to 8x speed-up in model training using mixed precision, as opposed to FP32. Mixed Precision allows for the storage of our model’s weights in FP16 (thus reducing memory consumption), whilst handling gradient weight updates in FP32 which has shown to have improved stability during training¹. Mixed Precision for DeepSpeech was introduced by Baidu in a blog post released in 2017, and since then engineering improvements has made mixed precision more accessible through PyTorch and available cloud hardware.

Matrix multiplications (GEMM) take up a significant portion of the computation time to train a neural network. Parallelising these calculations can result in large speed ups in training. Traditionally, these calculations run in single precision (FP32) which represents the lowest level of numerical precision accounted for during our computations. Of course, we can compute faster if we use lower levels of precision, such as FP16/INT8; however due to the loss of precision, training can become unstable resulting in gradients exploding and model divergence.

We’ve found numerous benefits to using mixed precision within our training pipeline. One key benefit is that it better utilizes the Tensor Cores found on NVIDIA V100 cards, which — as they specialise in parallelizing GEMMS/Convolutions in mixed precision — significantly reduces our computation time when training our model. Considering DS2 has a high GPU memory consumption, the fact that mixed precision halves memory consumption means we can use larger batch sizes and can more efficiently utilise the available GPU memory.

The main caveat of using mixed precision is the reduced precision in our calculations. Batch Normalization is a particular case where the loss of precision can have unfortunate side effects. Considering FP16 calculations of mean/variance can lead to instability and its heavy usage within the DeepSpeech 2 architecture, it is vital to make sure that the computation of these layers are kept at FP32. In order to facilitate this, NVIDIA have provided Apex for PyTorch, which supports Automatic Mixed Precision (AMP) to handle these cases and to ensure that the gradients do not cause instability in FP16. More information around the caveats of mixed precision training can be read here.

Baseline is 1 NVIDIA K80 (single) GPU using our custom end-to-end architecture. Our benchmarking script has been contributed to deepspeech.pytorch

As mentioned above, the reduction in memory consumption has allowed us to fit larger batch sizes into memory, hence improving the memory saturation of our GPUs. Increasing our batch size does vary convergence, however our production runs have indicated that with correctly tuned hyper-parameters convergence is similar to using smaller batch sizes.

A technical point to note: When scaling our production system, dynamic loss scaling was crucial to ensure that our gradients do not explode during training when using mixed precision. This became more apparent when using larger mini-batches and more GPUs. Appropriate hyper-parameter tuning was also required in order to ensure hyper-parameters scale up from smaller experiments. We’ve also noted a bug when using the Warp-CTC loss function at scale, and updated an issue to track this.

Multi-machine Training using KubeFlow

Kubernetes is a flexible and powerful container orchestration system for Machine Learning. It helps tremendously when deploying and scaling containerised solutions. Furthermore, the addition of GPUs as a Kubernetes resource has opened the door for writing scalable orchestration applications with specific hardware (in our case Nvidia GPUs). More specifically, we can use Docker images of our training environment to train a model on multiple GPU-enabled nodes within a cluster. To facilitate this process, we employ KubeFlow², which vastly simplifies and streamlines the deployment and scaling of machine learning models using Kubernetes. Kubeflow is a very lightweight layer on top of Kubernetes, thus making it easy to use and modify. With Kubeflow we’ve been able to abstract the orchestration of our training nodes, allowing us to scale onto multiple machines seamlessly without having to worry about the underlying distribution. Additionally, in KubeFlow 0.4, PyTorch 1.0 beta support was added which made it seamless to take our PyTorch based training containers into our Google Cloud Kubernetes cluster. We use auto-provisioning for GPU nodes to scale depending on demand³. Instructions on how to set up KubeFlow for deepspeech.pytorch are here.

Efficient Scaling

Thanks to NVIDIA’s work in scaling model training, we’ve seen near linear scaling on multiple GPUs and across multiple machines. NVLINK provided by NVIDIA offers 10x the bandwidth of PCIe when communicating between Volta GPUs, by providing direct GPU to GPU communication fabric. Also provided is NCCL, a multi-node package that optimize communication across multiple nodes. Combining both NVLINK and NCCL allow us to reach near linear scaling across our local node GPUs as well as between nodes in our cluster. These have been integrated transparently into the multi-GPU wrapper found in the NVIDIA Apex package, thus are easy to add to our existing training pipeline. Additionally Google Cloud offers NVLINK between their V100 GPUs giving us benefits in scaling and less latency overhead.

Measurements have been normalized to 1 NVIDIA V100 card using average epoch times across LibriSpeech

We see near linear scaling going from 1 to 64 GPUs, showing good utilisation of all available GPUs. This provides a massive improvement over our overall training times, potentially taking training times from weeks to days.

Fast Research

Speedup over baseline when training on LibriSpeech with our improvements and scaling

Models trained across thousands of hours of audio now take days due to advancement in scalability and GPU utilisation. This allows us to iterate on models at scale, and obtain results faster in our active research allowing us to experiment at production scale and beyond. We hope by contributing to deepspeech.pytorch we help others iterate faster and aid in improving the library.

Join our team

We here at Digital Reasoning’s Audio Research Team would like to thank NVIDIA and PyTorch engineers for their contributions. If you’re interested in our research and would like to work with us, go here for more information.

Many of the changes discussed have been contributed to deepspeech.pytorch.