The 5-Step Recipe To Make Your Deep Learning Models Bug-Free

Source: Deep Learning on Medium

The 5-Step Recipe To Make Your Deep Learning Models Bug-Free

Troubleshooting Deep Neural Networks

In traditional software engineering, a bug usually leads to the program crashing. While this is annoying for the user, it is critical for the developer as they can inspect the errors to understand why. With deep learning, we sometimes encounter errors but all too often the program crashes without a clear reason why. While these issues can be debugged manually, deep learning models most often fail because of poor output predictions. What’s worse is that when the model performance is low, there is usually no signal about why or when the models failed.

In fact, a common sentiment among practitioners is that they spend 80–90% of time debugging and tuning the models, and only 10–20% of time deriving math equations and implementing things.

Suppose that you are trying to reproduce a result from a research paper for your work but your results are worse. You might wonder why the performance of your model is significantly worse than the paper that you’re trying to reproduce?

There are many different things that can cause this:

Josh Tobin at Full Stack Deep Learning Bootcamp November 2019 (
  • It can be implementation bugs. Most bugs in deep learning are actually invisible.
  • Hyper-parameter choices can also cause your performance to degrade. Deep learning models are very sensitive to hyper-parameters. Even very subtle choices of learning rate and weight initialization can make a big difference.
  • Performance can also be worse just because of data/model fit. For example, you pre-train your model on ImageNet data and fit it on self-driving car images, which are harder to learn.
  • Finally, poor model performance could be caused not by your model but your dataset construction. Common issues here include not having enough examples, dealing with noisy labels and imbalanced classes, splitting train and test set with different distributions.

I recently attended the Full-Stack Deep Learning Bootcamp in the UC Berkeley campus, which is a wonderful course that teaches full-stack production deep learning. Josh Tobin delivered a great lecture on troubleshooting deep neural networks. As a courtesy of Josh’s lecture, this article will provide a mental recipe for how to improve deep learning model’s performance; assuming that you already have an initial test dataset, a single metric to improve, as well as target performance based on human-level performance, published results, previous baselines, etc.

Josh Tobin at Full Stack Deep Learning Bootcamp November 2019 (

Note: You can also watch the version of Josh’s talk at Reinforceconf 2019 and go through the full guide on Josh’s website.

1 — Start Simple

The first step is the troubleshooting workflow is starting simple.

There are a few things to consider when you want to start simple. The first is how to choose a simple architecture. These are architectures that are easy to implement and are likely to get you part of the way towards solving your problem without introducing as many bugs.

Architecture selection is one of the many intimidating parts of getting into deep learning because there are tons of papers coming out all-the-time and claiming to be state-of-the-art on some problems. They get very complicated fast. In the limit, if you’re trying to get to maximal performance, then architecture selection is challenging. But when starting on a new problem, you can actually just solve a simple set of rules that will allow you to pick an architecture that allows you to do a decent job on the problem that you’re working on.

  • If your data looks like images, start with a LeNet-like architecture and consider using something like ResNet as your codebase gets more mature.
  • If your data looks like sequences, start with an LSTM with one hidden layer and/or temporal/classical convolutions. Then, when your problem gets more mature, you can move to an Attention-based model or a WaveNet-like model.
  • For all other tasks, start with a fully-connected neural network with one hidden layer and use more advanced networks later depending on the problem.
Josh Tobin at Full Stack Deep Learning Bootcamp November 2019 (

In reality, many times the input data contains multiple of those things above. So how to deal with multiple input modalities into a neural network? Here is the 3-step strategy that Josh recommended:

  • First, map each of these modalities into a lower-dimensional feature space. In the example above, the images are passed through a ConvNet and the words are passed through an LSTM.
  • Then we flatten the outputs of those networks to get a single vector for each of the inputs that will go into the model. Then we concatenate those inputs.
  • Finally, we pass them through some fully-connected layers to an output.

After choosing a simple architecture, the next thing to do is to select sensible hyper-parameter defaults to start out with. Here are the defaults that Josh recommended:

The next step is to normalize the input data, which means subtracting the mean and dividing by the variance. Note that for images, it’s fine to scale values to [0, 1] or [-0.5, 0.5] (for example, by dividing by 255).

The final thing you should do is to consider simplifying the problem itself. If you have a complicated problem with massive data and tons of classes to deal with, then you should consider:

  • Working with a small training set around 10,000 examples.
  • Using a fixed number of objects, classes, input size…
  • Creating a simpler synthetic training set like in research labs.

This is important because (1) you will have reasonable confidence that your model should be able to solve, and (2) your iteration speed will increase.

The diagram below neatly summarizes how to start simple:

Josh Tobin at Full Stack Deep Learning Bootcamp November 2019 (

2 — Implement and Debug

To give you a preview, below are the 5 most common bugs in deep learning models that Josh recognized:

  • Incorrect shapes for the network tensors: This bug is a common one and can fail silently. A lot of time, this happens due to the fact that the automatic differentiation systems in deep learning framework do silent broadcasting. Tensors become different shapes in the network and can cause a lot of problems.
  • Pre-processing inputs incorrectly: For example, you forget to normalize your inputs or apply too much input pre-processing (over-normalization and excessive data augmentation).
  • Incorrect input to the model’s loss function: For example, you use softmax outputs to a loss that expects logits.
  • Forgot to set up train mode for the network correctly: For example, toggling train/evaluation mode or controlling batch norm dependencies.
  • Numerical instability: For example, you get `inf` or `NaN` as outputs. This bug often stems from using an exponent, a log, or a division operation somewhere in the code.

Here are 3 pieces of general advice for implementing your model:

  • Start with a lightweight implementation. You want a minimum possible new lines of code for the 1st version of your model. The rule of thumb is less than 200 lines. This doesn’t count tested infrastructure components or TensorFlow/PyTorch code.
  • Use off-the-shelf components such as Keras if possible, since most of the stuff in Keras works well out-of-the-box. If you have to use TensorFlow, then use the built-in functions, don’t do the math yourself. This would help you avoid a lot of the numerical instability issues.
  • Build complicated data pipelines later. These are important for large-scale ML systems, but you should not start with them because data pipelines themselves can be a big source of bugs. Just start with a dataset that you can load into memory.
Josh Tobin at Full Stack Deep Learning Bootcamp November 2019 (

The first step of implementing bug-free deep learning models is getting your model to run at all. There are a few things that can prevent this from happening:

  • Shape mismatch / Casting issue: To address this type of problem, you should step through your model creation and inference step-by-step in a debugger, checking for correct shapes and data types of your tensors.
  • Out-Of-Memory-Issues: This can be very difficult to debug. You can scale back your memory-intensive operations one-by-one. For example, if you create large matrices anywhere in your code, you can reduce the size of their dimensions or cut your batch size in half.
  • Other Issues: You can simply Google it. Stack Overflow would be great most of the time.

Let’s zoom in on the process of stepping through model creation in a debugger and talk about debuggers for deep learning code:

  • In PyTorch, you can use ipdb — which exports functions to access the interactive IPython debugger.
  • In TensorFlow, it’s trickier. TensorFlow separates the process of creating the graph and executing operations in the graph. There are 3 options you can try: (1) step through the graph creation itself and inspect each tensor layer, (2) step into the training loop and evaluate the tensor layers, or (3) use TensorFlow Debugger (tfdb) which does option 1 and 2 automatically.

After getting your model to run, the next thing you need to do is to overfit a single batch of data. This is a heuristic that can catch an absurd number of bugs. This really means that you want to drive your training error arbitrarily close to 0.

Josh Tobin at Full Stack Deep Learning Bootcamp November 2019 (

There are a few things that can happen when you try to overfit a single batch and it fails:

  • Error goes up: Commonly this is due to a flip sign somewhere in the loss function/gradient.
  • Error explodes: This is usually a numerical issue, but can also be caused by a high learning rate.
  • Error oscillates: You can lower the learning rate and inspect the data for shuffled labels or incorrect data augmentation.
  • Error plateaus: You can increase the learning rate and get rid of regulation. Then you can inspect the loss function and the data pipeline for correctness.
Josh Tobin at Full Stack Deep Learning Bootcamp November 2019 (

Once your model overfits in a single batch, there can still be some other issues that cause bugs. The last step here is to compare your results to a known result. So what sort of known results are useful?

  • The most useful results come from an official model implementation evaluated on a similar dataset to yours. You can step through the code in both models line-by-line and ensure your model has the same output. You want to ensure that your model performance is up to par with expectations.
  • If you can’t find an official implementation on a similar dataset, you can compare your approach to results from an official model implementation evaluated on a benchmark dataset. You most definitely want to walk through the code line-by-line and ensure you have the same output.
  • If there is no official implementation of your approach, you can compare it to results from an unofficial model implementation. You can review the code the same as before, but with lower confidence because almost all the unofficial implementations on GitHub have bugs.
  • Then, you can compare to results from a paper with no code (to ensure that your performance is up to par with expectations), results from your model on a benchmark dataset (to make sure your model performs well in a simpler setting), and results from a similar model on a similar dataset (to help you get a general sense of what kind of performance can be expected).
  • An under-rated source of results come from simple baselines (for example, the average of outputs or linear regression), which can help make sure that your model is learning anything at all.

The diagram below neatly summarizes how to implement and debug deep neural networks:

Josh Tobin at Full Stack Deep Learning Bootcamp November 2019 (

3 — Evaluate

The next step is to evaluate your model performance and use that evaluation to prioritize what you are going to do to improve it. You want to apply the bias-variance decomposition concept here.