Notes on Deep Learning — Data Loader

Source: Deep Learning on Medium


Go to the profile of Venali Sonone

This is the eight part of a 34-part series, ‘notes on deep learning’. Please find links to all parts in the first article.


Deep learning is a success because of big data.
When it comes to machine learning or deep learning 80% or more is spent wrangling with data.
The effort spent by a machine learning engineer or a data scientist to just get the data into format prepare and clean for abnormality is immense. The part of wrangling isn’t even trivial and most important part, so it can’t even be skipped. The wrangling, in fact, is given special attention as it affects the performance the most. This chapter is often revisited a couple of times to be revised.

The deep learning loves noise and learns better with noise but it cannot be yet denied that one has to prepare the data. The loading of data is one of the inevitable tasks. PyTorch helps us with load, preprocess our non-trivial datasets.

It also makes code looks nice^^ so why not do it better :)

Until this point in series, we skillfully iterated over our nice hand-written while loop. This was simple efficient but we could do a lot more than a simple iteration over our data. In particular, we could

  • Create batches
  • Shuffle the data
  • Load data in parallel enhancing multiprocessing
  • Access easy to use functions to iterate so we don’t have to worry
  • Use state-of-art standards and common standards across all the programs we develop

These all feature are accessible at ease if we use torch.utils.data.DataLoader

A dataset
A dataset is your data. In Pytorch it is an abstract representation of a class with functions:

  • len so that len(dataset) returns the size of the dataset.
  • getitem to support the indexing such that dataset[i] can be used to get ith sample

If we need to write our custom data loader we need to override the above methods


Let’s jump into the notebook for building our custom data loader …


About the Author

I am venali sonone, a data scientist by profession and also a management graduate.



Motivation

This series is inspired by failures. 
If you want to have a talk about short 5 years or 50 years, the latter indeed require something challenging enough to keep the spark in your eyes.