Tips for Debugging Machine Learning & Deep Learning

Source: Deep Learning on Medium

Tips for Debugging Machine Learning & Deep Learning

This article covers some tips and tricks to debug issues in Machine Learning and Deep Learning. We will discuss common errors seen during training and hyper parameter tuning and some potential reasons caused the issues.

Best Practice

  • Always test out the architecture by running it for 1–2 epochs first before training it for real
  • Use model.summery() in Tensorflow or print(model) in Pytorch to view the model architecture. This is especially useful in CNN
  • Try the model architecture on a known dataset first
  • Writing unit test and test functions along the way will minimize code logic errors
  • These unit tests can be re-used. That’s really important
  • Stanford’s CNN course talked multiple times about testing networks by using known distribution to initialize weights and parameters. It makes the initial result predictable. This way we can spot architecture issue early.

Dimension Error

Ultimately machine learning and deep learning uses a lot of linear algebra — matrices and matrix multiplications. Dimension errors can occur. Most modern libraries for ML and DL will point out the mismatched shape. Use shape (num_1, num_2) (num_3, num_4) for example, while in matrix multiplication it is important to match 2 and 3 in order for the matmul to work out, it is also important to think about where 2 and 3’s actual dimension comes from.

For example in transfer learning: the last layer of vgg16 is 25088. It is important to figure out where that number comes from in the VGG16 architecture. If the in_feature count for the new classifier used in transfer learning to replace the old, then there will be an error.

CUDA model data didn’t move together issue

You are warned of this issue in a pytorch if the tensor types didn’t match. You can use type(variableName) to debug. Usually it says expects cuda tensor but got normal tensor. This is when model and data didn’t move to cuda together (GPU) or didn’t move back to cpu together. The model and data must live in the same place for training to happen.


Some developers prefer interactive debuggers. Did you know that Jupyter Notebook allows python debugger to be used in the code cells? This is very useful for Pytorch. You can use the python debugger with Pytorch. You can use Tensorflow to log and visualize training and debug.

Interactive Console IDE

Anaconda provides IDE for developing data science and ML projects. Also the anaconda spyder.

Some data scientists we interviewed mentioned Pycharm for python.

Jupyter Lab is set to replace Jupyter Notebook. Jupyter Lab allows opening and working among several files in a folder not just one notebook.

Virtual environment & anaconda environment


This happens when columns are linearly dependent of each other. The high correlation between features columns and especially between feature columns and the target column could cause issues with the model. Causing it to be predictable; and causing it to no generalize to real world inputs.

Data Leak

If there’s data leak that causes the target column to be predictable, this will also cause the model to be not useful and cannot generalize to real world, new data.

Hyper parameter tuning

Is your model taking forever to train? It could be that your learning rate is too small such as 0.0001 it could also mean you are using gridsearchcv which searches all the pairwise combos. Randomized algorithms are usually faster and that includes random search cv. Random start random initialization is also faster. If the cross validation requires change of kernel that can take a while. If the hyper parameter tuning requires variation in updating distance functions for all the data points it can also take a while.

If your result not improving? Try a smaller learning rate. Learning rate usually cannot be large such as 0.5. It can fail to find maxima.

Unreasonably high accuracy

If there’s severe class imbalance, accuracy may seem too high, unseasonably high. One can use more robust metric as F beta score. One can graph and visualize class imbalance, and change the sampling of classes.

Speed up training

Use GPU for training. Use batches. Avoid loops and apply in data cleaning. Scale and normalize features. Use batches to parallelize compute if possible. Some models are faster than others. Train a subset first.