How to make your deep learning experiments reproducible and your code extendible

Original article was published by on AI Magazine

Improving your deep learning code quality (Part I)

Lessons learned from building an open-source deep learning for time series framework.

Photo by author (taken while hiking at the Cutler Coast Preserve in Machias ME)

Note this is roughly based on a presentation I made back in February at the Boston Data Science Meetup Group. You can find the full slide deck here. I have also included some more recent experiences and insights as well as answers to common questions that I have encountered.


When I first started my river forecasting research, I envisioned using just a notebook. However, it became clear to me that effectively tracking experiments and optimizing hyper-parameters would play a crucial role in the success of any river flow model. Particularly, as I wanted to forecast river flows for over 9,000+ rivers around the United States. This led me on the track to develop flow forecast which is now a multi-purpose deep learning for time series framework.


One of the biggest challenges in machine learning (particularly deep learning) is being able to reproduce experiment results. Several others have touched on this issue, so I will not spend too much time discussing it. For a good overview of why reproducibility is important see Joel Grus’s talk and slide deck. TLDR: is that in order to build upon prior research we need to make sure that it worked in the first place. Similarly, to deploy models we have to be able to easily find the artifacts of the best “one.”

One of my first recommendations to enable reproducible experiments is NOT to use Jupyter Notebooks or Colab (at least not in their entirety). Jupyter Notebooks and Colab are great for rapidly prototyping or leveraging an existing code-base (as I will discuss in a second) to run experiments, but not for building out your models or other features.

Write High Quality Code

  1. Write unit tests (preferably as you are writing the code)

I have found test driven development very effective in the machine learning space. One of the first things people ask me often is how do I write tests for model that I don’t what its outputs will be? Fundamentally your unit tests should fall into one of the four categories

(a) Test that your model’s returned representations are the proper size.

This is probably one of the easiest tests. With this you simply see if the shape is correct.

(b) Test that your models initialize properly for the parameters you specify and that the right parameters are trainable

Another relatively simple unit test is make sure that model initializes in the way you expect it to and the proper parameters are trainable. While this may seem obvious you would be surprised how many bugs emerge from this.

(c ) Test the logic of custom loss functions and training loops:

People often ask me how to do this? I’ve found the best way is to create a dummy model with a known result to test the correctness of custom loss functions, metrics, and training loops. For instance you could create a PyTorch model that only returns 0. Then use that to write a unit test that check if the loss and training loop is correct.

(d) Test the logic of data pre-processing/loaders

Another major things to add is to make sure your data loaders are outputting data in the format you expect it and handling problematic values. Problems with data quality are a huge issue with machine learning so it is important to make sure your data loaders are properly tested. For instance, you should write tests to check NaN/Null values are handled in the way that you expect.

Finally, I also recommend use tools CodeCov and Codefactor. They are useful for automatically determining you code test coverage.

Recommended tools: Travis-CI, CodeCov

2. Utilize integration tests for end-to-end code coverage

Having unit tests is good, but it is also important to make sure your code runs properly in an end-to-end fashion. For instance, sometimes I’ve found a single model’s unit tests run only to find out the way I was passing the configuration file to the model didn’t work. As a result I now add integration tests for every new model I add to the repository. The integration tests can also demonstrate how to use your models. For instance, the configuration files I use for my integration tests I often use as the backbone of my full parameter sweeps.

3. Utilize both type hints and document strings:

Having both type hints and document strings greatly increases readability. Particularly when you are passing around tensors. When I’m coding I frequently have to look back at the doc-strings to remember to what shape my tensors are. Without them I have to manually print the shape which wastes a lot of time and potentially adds garbage that you later forget to remove.

4. Create good documentation

I’ve found the best time to create documentation for machine learning projects is while I’m writing code or even before. Often laying out the architectural design of ML models and how they will interface with existing classes saves me considerable time when implementing the code as well as forces me to think critically about the ML decisions I make. Additionally, once the implementation is complete you already have good start on informing your peers/researchers on how to use your model.

Of course you will need to add some things like specifics of what parameters you pass and their types. For documentation, I generally record broader architectural decisions in Confluence (or GH Wiki pages if unavailable) whereas specifics about the code/parameters I include in ReadTheDocs. As we will talk about in Part II having good initial documentation also makes it pretty to add model results and explain why your design works.

Tools: ReadTheDocs, Confluence

5. Leverage peer reviews

Peer review is another critical step in making sure your code is correct before you run your experiments. Often times, a second pair of eyes can help you avoid all sorts of problems. This is another good reason not to use Jupyter Notebooks as reviewing notebook diffs is almost impossible.

As a peer reviewer it is also important to take time to go through the code line by line and comment where you don’t understand something. I often see reviewers just quickly approve all changes.

A recent example: Recently while adding meta-data for DA-RNN I encountered a bug. This section of code did have an integration test but unfortunately it lacked a comprehensive unit test. As a result several of the experiments that I run that I thought used meta-data turned out not to. The problem was very simple on line 23. I forget to include meta-data while calling the encoder.

This problem could likely have been averted by writing a unit tests to check that the parameters of the meta-data model are updated on the forward pass or by having a test to check that the result with and without the meta-data were not equal.

In Part II of this series I will talk about how to actually make your experiments reproducible now that you have high quality and (mostly) bug free code. Specifically, I will look at things like data versioning, logging experiment results, and tracking parameters. So

Relevant Articles and Resources:

Unit testing machine learning code

Joel Grus Why I hate Notebooks