TF.data reborn from the ashes

Source: Deep Learning on Medium


In this article we are going to take a look at a brand new way of creating datasets using Tensorflow 2.0.

credits pexel.com

Tensorflow(TF) has opened a space for whole new ways of creating datasets, which were previously unavailable, on top of that TF 2.0 has added their own flavour to the mix.

Drawing inspiration from the various competitors, TF team managed to take dataset creation by subclassing to whole new level where by any one can share there dataset with everything set up for training, just like a lego brick(‘Plug and play’).

TensorFlow datas v.FINALLY YOU GOT IT RIGHT!!!

Tensorflow has been maturing over the years, this year with tf.data API the TF team brought two new amazing abstractions to TF:

  • tf.data.Dataset which represents sequences of rows, in which a row has one or more tensors(a matrix=>3D) objects, such as images and labels for example.
  • tf.data.Iterator provides the main way to extract elements from a dataset. The operation returned by Iterator.get_next() yields the next element of a Dataset when executed, and typically acts as the interface between input pipeline code and your model The simplest iterator is a “one-shot iterator”, which is associated with a particular Dataset and iterates through it once. For more sophisticated uses, theIterator.initializer operation enables you to reinitialize and parameterize an iterator with different datasets, so that you can, for example, iterate over training and validation data multiple times in the same program.

Innerworkings

Some of the old tf.data API basic mechanics still prevail, such as to create a Dataset and iterator objects, and how to extract data from them and fed it to your model.

To start an input pipeline, you must define a source. For example, to construct a Dataset from some tensors in memory, you can use tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). Alternatively, if your input data are on disk in the recommended TFRecord format, you can construct atf.data.TFRecordDataset.

A TFRecord file stores your data as a sequence of binary strings,since working with large datasets can be hard, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.

However, pure performance isn’t the only advantage of the TFRecord file format. It is optimized for use with Tensorflow in multiple ways.

After creating the dataset you can use tf.data transformations such as:

  • .map() — to apply a function to each element.
  • .shuffle() — to shuffle the dataset order.
  • .bactch() — to create batches of data, let’s say you have 1000 images in total, instead of loading and showing all images at once to your model you can show 64 images at the time for example.
  • .repeat() — to repeat elements in the dataset.
  • .prefetch() — to fetch next batch elements to be processed while the previous batch is being processed.

Datasets v1.0.1

And now the long awaited subclassing module for tf.data

Datasets are distributed in all kinds of formats and in all kinds of places, and they’re not always stored in a format that’s ready to feed into a machine learning pipeline.

With Datasets v1.0.1, TF 2.0 brings a big surprise to the developer community. It now has a growing collection of datasets ready to use with TensorFlow. It handles downloading and preparing the data and constructing a tf.data.Dataset.

You can visualize all available datasets in the collection using the following line of code:

tfds.list_builders()

Now you can load a pre-made dataset as easy a one line of code.

mnist_train = tfds.load(name="mnist", split=tfds.Split.TRAIN)

Adding download=True will download and prepare the data.

Alltfds datasets contain feature dictionaries mapping feature names to Tensor values. A typical dataset, like MNIST, will have 2 keys: "image" and "label".

From here all the methods in the previous sections such as batch, shuffle and etc, are applicable to this dataset.

Furthermore, TFDS(TensorFlow Datasets) provides a way to transform all those datasets into a standard format, do the preprocessing necessary to make them ready for a machine learning pipeline, and provides a standard input pipeline using tf.data.

To enable this, each dataset implements a subclass of DatasetBuilder, which specifies:

  • Where the data is coming from (i.e. its URL);
  • What the dataset looks like (i.e. its features);
  • How the data should be split (e.g. TRAIN and TEST);
  • and the individual records in the dataset.

The first time a dataset is used, the dataset is downloaded, prepared, and written to disk in a standard format. Subsequent access will read from those pre-processed files directly.

Adding a Dataset

This is the most exciting announcement, because it gives you the power to contribute directly to the TF Deep Learning community, by adding your own dataset to the collection.

Each dataset is defined as a subclass of tfds.core.DatasetBuilder implementing the following methods:

  • _info: builds the DatasetInfo object describing the dataset
  • _download_and_prepare: to download and serialize the source data to disk
  • _as_dataset: to produce a tf.data.Dataset from the serialized data

Most datasets subclass tfds.core.GeneratorBasedBuilder, which is a subclass of tfds.core.DatasetBuilder that simplifies defining a dataset. It works well for datasets that can be generated on a single machine. Its subclasses implement:

  • _info: builds the DatasetInfo object describing the dataset
  • _split_generators: downloads the source data and defines the dataset splits
  • _generate_examples: yields examples in the dataset from the source data

In this article we will use DatasetBuilder in a ambitious attempt to make a demo.

Manual download and extraction

For source data that cannot be automatically downloaded (for example, it may require a login), the user will manually download the source data and place it in manual_dir, which you can access withdl_manager.manual_dir (defaults to ~/tensorflow_datasets/manual/my_dataset).


Although there is no documentation for Subclassing DatasetBuilder, I think I managed to give you an idea of how it can be done.

I wish there was a better documentaion about core modules, . Finally, now you can go out there build your own dataset and share with the world, this is a key factor for development, here is a thing about TF 2.0 that’s different and set’s it apart from the competition. GREAT WORK TF TEAM!!!

Summary

Things to remember:

  • Datasets v1.0.1 now tensorflow has inbuilt loader that downloads, saves it into your disk and prepares the data in the right format for you.
  • By Subclassing DatasetBuilder and GeneratorBasedBuilder you can create your own custom made dataset and submit it to tensorflow’s datasets collection where it can be available for everyone.

This article is part of a series where I will be sharing the highlights of the TF Dev Summit ’19 and surprise spin-off article that I will share a new AI race.


Thank you for reading. If you have any thoughts, comments or critics please comment down below.

Follow me on twitter at Prince Canuma, so you can always be up to date with the AI field.

If you like it and relate to it, please give me a round of applause 👏👏 👏(+50) and share it with your friends.