MONAI v0.3 brings GPU acceleration through Auto Mixed Precision (AMP), Distributed Data…

Original article was published by MONAI Medical Open Network for AI on Deep Learning on Medium

MONAI v0.3 brings GPU acceleration through Auto Mixed Precision (AMP), Distributed Data Parallelism (DDP), and new network architectures

MONAI is an open-source foundation for deep learning in healthcare imaging that is a freely available, community-backed, and PyTorch-based framework. MONAI provides domain-optimized foundational capabilities for developing healthcare imaging training workflows in a native PyTorch paradigm.

In collaboration with MICCAI Educational Initiative, MONAI hosted its first Bootcamp on September 30th to October 2nd. The event included training modules, architectural deep dives, and an open challenge on the last day. You’ll find some of the new features from v0.3 used in the training material, which you can find on the Project MONAI GitHub. Look for an upcoming blog post with more information and links to videos for each Bootcamp section.

The v0.3 release focuses on bringing GPU acceleration through Auto Mixed Precision (AMP) and Distributed Data Parallelism (DDP). You’ll also find: new network architectures like AHNet, additional loss functions and evaluation metrics, an IO Factory for various medical image formats, and an update to make AWS Open Data its primary repository for public datasets. We also showcase a Jupyter Notebook that uses AMP, along with other MONAI features, and show you how to get up to a 12x speed-up in certain training use cases.

What’s new in v0.3

The overall architecture and modules are shown in the following figure:

GPU acceleration

NVIDIA GPUs are widely used in deep learning training and inference, and CUDA has demonstrated its ability to accelerate these computational methods. Many popular methodologies have been created to leverage GPU features, like automatic mixed precision (AMP) and Distributed Data Parallelism. MONAI has integrated these features to accelerate common training workflows and provides rich examples to demonstrate the API usages.

Auto mixed precision (AMP)

In 2017, NVIDIA researchers developed a methodology for mixed-precision training, which combined single-precision (FP32) with half-precision (FP16) format when training a network achieving faster training speed with the same accuracy as FP32 training using the same hyperparameters.

As of the PyTorch 1.6 release, developers at NVIDIA and Facebook integrated the mixed-precision functionality into PyTorch core as the AMP package, torch.cuda.amp. MONAI has exposed this feature in the workflow implementations by providing access to the amp parameter. To enable or disable AMP in MONAI workflows, simply set amp=True/False in SupervisedTrainer or SupervisedEvaluator during training or evaluation.

More details are available at the AMP Training Tutorial.

Fast Training

To demonstrate AMP’s power combined with other features in MONAI, we simultaneously apply AMP, CacheDataset, and Novograd optimizer to achieve faster training (convergence) in MONAI. We obtained an approximately 12x speedup for spleen CT segmentation, compared with the native PyTorch implementation while converging at the validation mean Dice score 0.93. At the same time, utilizing AMP reduced GPU memory footprint by 30%. Benchmark for reference:

You can find the notebook with the full experiment at the Fast Training Tutorial.

Distributed Data Parallel (DDP)

DistributedDataParallel implements Data Parallelism and allows PyTorch to connect multiple GPU devices on one or several nodes to train or evaluate models.

MONAI provides demos for training and evaluating with PyTorch DDP, Horovod, Pytorch-Ignite DDP. These demos include dataset partitioning and the dataset caching mechanism to further improve the training performance. You can also find real-world training examples based on the Decathlon challenge Task01. This example of a Brain Tumor segmentation challenge contains distributed caching for training and validation.

To show the benefit of using DistributedDataParallel, we performed benchmarks ranging from 1node/1GPU to 4node/8gpu (Specs: PyTorch 1.6, CUDA 11, Tesla V100 GPUs). You can see the results below:

C++/CUDA optimized modules

C++ and CUDA implementations can be impressive methods that accelerate computationally intensive workflows, sometimes up to hundreds of times faster than their original counterparts. MONAI now includes C++ and CUDA optimized modules, including image resamplers. We now also support C++ and CUDA programs in our CI, CD, and Packaging processes.

Registry of Open Data on AWS

MONAI now utilizes the Registry of Open Data on AWS as its primary remote repository for public datasets, including updating monai.apps.DecathlonDataset to pull from this repository. The MONAI Development team also manages the data source on AWS. All data are made available with a permissive copyright-license (CC-BY-SA 4.0), allowing for data to be shared, distributed, and improved upon.

You can find out more about it on the Decathlon Data Page.

Medical image data I/O, processing and augmentation

Medical images require highly specialized methods for I/O, preprocessing, and augmentation. Many of these formats include specialized formats with meta-information and high-dimensional data volumes. These require manipulation procedures explicitly designed with those attributes in mind.

MONAI facilitates this by focusing on user-friendly, reproducible, optimized medical data pre-processing pipelines. These tenets allow MONAI to enable robust and flexible image transformations.

IO factory for medical image formats

To efficiently handle different medical image formats in the same pipeline, MONAI provides the LoadImage transform. This transform uses ITKReader as the default image reader and supports registering other readers, like NibabelReader, NumpyReader, and PILReader. The ImageReader API is relatively straight-forward, and users can easily extend it for their customized image readers.

MONAI now supports loading images in the following formats: NIfTI, DICOM, PNG, JPG, BMP, and NPY/NPZ.


Network architectures

Specific deep neural network architectures have shown to be particularly useful for medical imaging analysis tasks. The v0.3 release brings you reference networks with the aims of both flexibility and code readability. There are now implementations for UNet, DynUNet, DenseNet, GAN, AHNet, VNet, SENet(and SEResNet, SEResNeXt), and SegResNet.


Unlike regular images, a single volumetric medical image could be hundreds of megabytes. An efficient approach to train a large volume dataset while reducing the I/O burden is to cache a partial or full dataset in the RAM. Utilizing RAM, whose read/write speed is much faster than disk, may result in a higher GPU utilization rate. If only part of the dataset is cached, only this portion of the data is used during an epoch. Meanwhile, the cached data is dynamically replaced by the data on the disk. This technique is called SmartCache in the NVIDIA Clara Train SDK, and we’re utilizing a similar approach with our SmartCacheDataset implementation in MONAI.

For example, if we have a collection of five images: [image1, image2, image3, image4, image5], using the parameters cache_num=4 and replace_rate=0.25, we may see the following sampling sequence during training:

epoch 1: [image1, image2, image3, image4]
epoch 2: [image2, image3, image4, image5]
epoch 3: [image3, image4, image5, image1]
epoch 3: [image4, image5, image1, image2]
epoch N: [image[N % 5] …]

You can find a full example of SmartCacheDataset at Distributed Training with SmartCache.


Many domain-specific loss functions in medical imaging research are not typically used in generic computer vision tasks. MONAI has now implemented many of these loss functions, including DiceLoss, GeneralizedDiceLoss, MaskedDiceLoss, TverskyLoss, and FocalLoss.


To quickly set up training and evaluation experiments, MONAI provides a set of workflows to simplify the prototyping process. The idea behind these workflows is to decouple the domain-specific components and the generic machine learning processes.

These workflows provide a set of unifying APIs for higher-level applications (AutoML, Federated Learning). The trainers and evaluators of the workflows are compatible with the pytorch-ignite Engine and Event-Handler mechanism. MONAI provides rich event handlers to attach to the trainer or evaluator independently.

The end-to-end training and evaluation examples are available at Workflow Examples.

Ensemble Evaluator

Ensemble Modeling is a popular strategy in machine learning and deep learning that can help you achieve a more accurate and stable output. MONAI now provides an easy way for you to utilize this strategy using the EnsembleEvaluator. Ensemble Modeling typically works by doing the following:

1. Split all the training dataset into K folds.

2. Train K models with every K-1 folds data.

3. Execute inference on the test data with all the K models.

4. Compute weighted averaging or majority voting as the final result.

You can find a full example of Ensemble Modeling at the Model Ensemble Tutorial.

Get started with MONAI on GitHub at or visit our website at We would love to hear your feedback!