Intel MLSL Makes Distributed Training with MXNet Faster

Source: Deep Learning on Medium

Intel MLSL Makes Distributed Training with MXNet Faster

Written by Lin Yuan @Amazon; Wuxun Zhang, Smorkalov Mikhail, Guokai Ma, Patric Zhao, Jason Ye @Intel

Apache MXNet distributed training using Horovod with Intel MLSL achieves 98% scaling efficiency while OpenMPI reaches only 45.5% scaling efficiency over 25GbE network.


The exponential growth in the use of large deep neural networks (DNNs) has increased the need for training these networks quickly. This can only be achieved through scalable and efficient distributed training since a single node cannot satisfy the computation, memory, and I/O requirements of today’s state-of-the-art deep neural networks.

Apache MXNet is an open-source deep learning framework used to build, train and deploy deep neural networks. It supports training on a single node as well as distributed training across a cluster of nodes. There are two ways to launch distributed training in MXNet: (1) parameter-server based approach implemented using ps-lite and (2) collective communication based approach implemented using Horovod, an open-source distributed training framework created by Uber engineers. Horovod implements collective communication methods such as broadcast, allreduce, allgather, etc. through the Message Passing Interface (MPI) and efficient communication libraries such as NCCL for Nvidia GPUs and OpenMPI for CPUs. Earlier work has shown that distributed training using Horovod can achieve better scalability for large dense neural networks such as ResNet-50. However, using an out-of-the-box Horovod implementation for training on CPUs, scaling efficiency may decrease rapidly as the number of nodes is scaled to a few dozen and beyond.

The Intel® Machine Learning Scaling Library (Intel MLSL) [arXiv, GitHub] is a communication middleware that specifically targets the distributed training using CPUs. It is designed to provide an abstraction layer over communication patterns commonly used in deep learning domains and enables the portable performance of communication operations across various deep learning frameworks.

Apache MXNet is the first DNN framework to leverage Intel MLSL to make distributed training more flexible and efficient.

Intel MLSL

To improve scalability for multi-node training of deep learning models, the Intel MLSL has several key features:

  • Choosing the correct partitioning strategy

There are two common parallelization techniques for multi-node training: data parallelism and model parallelism. In order to obtain optimal parallel performance, Intel MLSL adopts a novel partitioning strategy, called hybrid parallelism, and introduces the concept of node groups. Model parallelism is used for nodes within the same group, while data parallelism is used across groups.

  • Overlapping communication and computation

Besides providing an efficient implementation of communication patterns, Intel MLSL runtime enables additional optimizations such as using asynchronous progress to maximize compute-communication overlap and dedicating one or more CPU cores to drive the communication in an optimal manner.

Intel MLSL also exposes controls over thread pinning to ensure minimal interference between computing and communication threads. Even though the background thread of Horovod wakes up in only 5 ms by default, we observed that this may still negatively impact overall compute efficiency if the background thread shares the CPU core with a compute thread. Therefore, the MLSL-based backend of Horovod allows users to explicitly pin Horovod’s background thread and exclude them from CPU mask passed to the application.

Figure 1: Intel MLSL uses dedicated communication threads for communication and minimizes interference over computation
  • Reducing communication volume

When the network bandwidth is limited, scaling efficiency can be further improved by reducing the volume of communicated data. Using lower precision data types during training (e.g. float16) reduces the amount of communication required and has had a positive impact on scaling. Intel MLSL provides support for parameter quantization for this reason.


When training a ResNet50 model using Apache MXNet with Intel MLSL, we achieved an overall throughput speedup of 31.4x (theoretically 32x) for 16 nodes of c5.18xlarge instances each equipped with Intel® Xeon processor on Amazon Elastic Compute Cloud. After running 90 epochs, the Top-1 accuracy on the ImageNet dataset can converge to 76.22%.

We also demonstrated the throughput speedup and scaling efficiency of distributed MXNet training using Horovod with Open-MPI and Horovod with Intel MLSL. In this experiment, we chose to run one process on each socket to fully utilize hardware resources. All results are shown in Table 1 and Figure 2. When the node count is less than two, the actual throughput speedup is close to the theoretical speedup regardless of whether Open MPI or Intel MLSL is used. At this time, the scaling efficiency corresponding to Open MPI and Intel MLSL is 87.53% and 98.98%, respectively. But as the number of nodes scales up to 16, Intel MLSL can maintain a similar throughput speedup to the theoretical maximum, while Open MPI drops dramatically. At this point, using Open MPI can only reach 45.5% scaling efficiency, which is far lower than the 98.2% obtained by Intel MLSL. In other words, large-scale MXNet distributed training can be significantly boosted by Intel MLSL.

Table 1: The overall training throughput speedup and scaling efficiency of distributed ResNet50 training in the case of different nodes used.
Figure 2: Comparison of training throughput speedup between Open MPI and Intel MLSL. The y-axis represents the overall throughput speedup. The x-axis represents the number of nodes.


We are excited to introduce a new software library, Intel MLSL, for improving the overall scaling performance of MXNet distributed training using Horovod. With optimizations from Intel MLSL enabled, we can achieve more than 98% of scaling efficiency for training a ResNet50 model using MXNet+Horovod on 16 nodes of c5.18xlarge.

Appendix: Steps to Reproduce

This part will introduce the detailed steps to reproduce the results above. First set up your training cluster locally or on a cloud. In our experiment, we used 16 c5.18xlarge instances on the AWS cloud.

Step 1: Install Intel MLSL

You can install MLSL by downloading binary releases (see release page). The README will help you install MLSL using RPM package manager or tar file step by step.

Also, it is very easy to build MLSL from the source. Once you have cloned this GitHub repository into your workspace, run the below commands:

make all[MLSL_INSTALL_PATH=/path] make install

By default, MLSL_INSTALL_PATH will be set to $(pwd)/_install.

Step 2: Prepare MPI Library

Intel’s MPI library is a multi-fabric message-passing library that can provide higher performance on clusters based on Intel processors. It is also required by Intel MLSL to get better scaling performance. Now, you can easily get the Intel MPI library from either tar file or apt repo.

Open MPI library is an open-source message passing interface implementation and also provide binary releases and source code. You can quickly install it by referring to this README.

Step 3: Prepare MXNet and Horovod

Once you have prepared the above libraries, it is time to install Horovod. Firstly, source to start using Intel MLSL. Two modes are available: process (default) and thread. Use the thread mode if you are going to set more than zero MLSL servers via MLSL_NUM_SERVERS environment variable.

source <installdir_MLSL>/intel64/bin/ threadsource <installdir_MPI>/intel64/bin/ release_mt

Install MXNet and Horovod via pip.

pip install mxnet-mkl==1.5.0pip install horovod==0.18.0

Now, you should have Horovod installed on your instances. You can try the below command to check if Horovod is properly installed.

>>> import horovod.mxnet as hvd

Step 4: Start training

The training script we used is available here. Hyperparameters are listed below. It should be noted that we should linearly scale the actual learning rate by the number of processes (i.e. the actual learning rate equal to the product of base learning rate and the number of processes).

  • base learning rate: 0.1
  • batch size: 128
  • learning rate scheduler: cosine
  • momentum: 0.9
  • weight decay: 0.0001

For Intel MLSL, you can run below command to launch multiple processes:

mpirun -n 32 \
-ppn 2 \
-f ${hostfile} \
-genv I_MPI_PIN_DOMAIN auto:compact \

For OpenMPI, you can run the command below:

mpirun -np 32 \
--hostfile ${hostfile} \
--bind-to socket \
--npersocket 1 \
-mca pml ob1 \
-mca btl ^openib \

Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit

Performance results are based on testing by AWS and Intel as of 4th Nov.2019 and may not reflect all publicly available security updates. No product or component can be absolutely secure.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at

Intel does not control or audit third-party data. You should review this content, consult other sources, and confirm whether referenced data are accurate.

Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. © Intel Corporation