Data Pipelines in Fastai (v1)

Original article was published on Deep Learning on Medium


Data Pipelines in Fastai (v1)

One of the most useful features that fastai brings to the table is its datablock api. I have been part of many machine learning projects, and one common theme across them has been the large amount of time I spend creating pipelines that can work across training/validation sets consistently.

I must admit that the the first time I started looking through fastai datablock apis to handle data, I was a little confused. It took me some time to grasp the framework, going through their docs, forums and blogs. Once I was past the hump, I could see why it’s built the way it is, and the value it brings: ease of use & flexibility to customize. While it may be intuitive for people to read and follow code, it has always been easier for me to visualize things in flow/block diagrams. Diagrams help me first look at things from a birds eye view, understand the overall objective, and then provide me ways to go deep in certain directions. I looked for blogs/articles/docs that could provide me a few diagrams to start quick, but I could not find any. In this article, I am hoping that I could demonstrate the api framework with some quick flow diagrams, which hopefully paint the larger picture, and help folks like me navigate the large framework faster.

Note: While it is not necessary to understand Object Oriented Programming (OOP) in python in depth, a basic understanding of OOP concepts in python will help a long way, especially when you need to dig through most of the pytorch and fastai codebase. There are plenty of youtube videos taking you through the concepts. I found the video playlist by Corey Schafer quite helpful in brushing up my OOP concepts: Link

Most of the ML pipelines consist of the following steps: