Scaling Up and Down: Why we moved to PyTorch-Lightning

Source: Deep Learning on Medium

Scaling Up and Down: Why we moved to PyTorch Lightning

Engineering Modular Neural Networks

co-authored with Sarah Jane Hong and Darryl Barnhart

Creating and training a model at scale has its challenges. Starting with a single model is easy enough. But as you scale your model up, you’ll need features like gradient accumulation, mixed precision, and hyperparameter scheduling. Then, as you want to scale out, you need to worry about all the details on how to setup a proper distributed training environment.

With all that, you still want to make sure you maintain a fast feedback loop as you try out new ideas, so you’re constantly having to scale up and down. Maintaining all those features take a lot of work, and it’s easy for modern deep learning code-bases to grow to be overly complex.

At Latent Space, we’re trying to push the state of the art in generative modelling while improving on metrics as wide ranging as disentanglement, model distillation, and temporal consistency. To help us iterate quickly, we created a framework internally called Lab to dynamically build models and track experiments.

The purpose of Lab is to make it incredibly easy to test different model configurations and architectures quickly. It can:

  • dynamically compose models from different blocks by swapping them in and out using gin configurations
  • sweep over different model architectures, also using gin.
  • train models on the cloud by spinning up instances, provisioning, and running them on one or many GPUs/TPUs, including multi-node scenarios.
  • generate synthetic datasets to overcome the limitations of current academic datasets.

Using Lab, we’ve been able to create and continually iterate on a complex models and datasets, but we started running into some issues.