How nbdev helps us structure our data science workflow in Jupyter Notebooks

Source: Deep Learning on Medium

Why do we use it?

As data scientists, much of our work involves — as you might expect — data. This means loading data, transforming data, combining data, and at some point actually using that data. Especially in the transforming and combining stages, it’s critical to ensure that no mistakes slip in. If you are trying to train a neural network for semantic segmentation but your segmentation map is shifted by a few pixels, your data is essentially invalid. Worse yet, valuable time is often lost trying to get some code or machine learning models to work, while minor typos have sneaked in such small, stupid, but bothersome bugs. Because notebooks are run iteratively through cells it’s almost like you’re debugging while coding. Errors are much quicker caught this way. Here at 20tree.ai, we mostly work with georeferenced (satellite) data. When different parts of your data are projected in different coordinate reference systems, it (again) becomes very easy for mistakes to slip in.

When developing new code, a pretty standard pattern for us consist of the following:

  1. Small functions are written in a Jupyter notebook. The notebook is used to visually inspect the output and to informally test that the code behaves as expected;
  2. The functions get copy-pasted into a proper codebase;
  3. The original notebooks are scattered to the wind;
  4. Code gets changed over time, maybe a mistake slips in. When asking for more details on a bit of code, someone points to some file called Untitled_v3_better_labels2.ipynb with the comment “it’s probably very outdated though”.

With nbdev, we can nip this whole sequence in the bud. You write your code in a Jupyter notebook and that’s it. You’re done, because the notebook is the proper codebase! The main code cells are exported to the library and the output of some cells forms the visual explanation as well as unit tests — that you’d have to make generally separately otherwise — that are automatically run when pushing the notebook to Github. So while iteratively coding documentation and testing are (almost) entirely free. For example, after implementing data augmentation, you’re surely going to visualize the outputs to ensure that the data looks as you would expect and, just as importantly, that the labels are similarly transformed. With nbdev, this visualization is simply in your codebase, right below the function definition:

These tests can also very easily be used in your Github continuous integration pipeline, making it very easy to do proper checks before merging some new code into your existing codebase.

The fact that your entire codebase is living in notebooks also means that when something is not working as expected, it is very easy and intuitive to debug. You can quickly change something and see how it affects the output.