TensorFlow 2: Build an image input pipeline with the new Dataset API

Source: Deep Learning on Medium

TensorFlow 2: Build an image input pipeline with the new Dataset API

Multiple changes haven been made in TF 2.0. TensorFlow 2.0 removes redundant APIs and better integrates with the Python runtime with Eager execution.

TensorFlow’s eager execution is a programming environment that evaluates operations immediately, without building graphs: operations return concrete values instead of constructing a computational graph to run later. This makes a big change from tf.Session.

When to use tf.Data ?

When iterating over training data that fits in memory, feel free to use regular Python iteration. Otherwise, tf.data.Dataset is the best way to stream training data from disk. Datasets are iterables (not iterators), and work just like other Python iterables in Eager mode. You can fully utilize dataset async prefetching/streaming as we will see further.

Step 1: Preparation

We are going to create an input pipeline to feed our model.

First of all we need to create two lists : one for image paths and one for the corresponding labels.

all_train_paths = LIST_OF_IMAGES_PATHS
all_train_labels = LIST_OF_CORRESPONDING_LABELS

Step 2: Preprocessing

def preprocess_image(image):
image = tf.image.decode_png(image, channels=3)
image = tf.image.resize(image, [HEIGHT, WIDTH])
image /= 255.0
return image

def load_and_preprocess_image(path):
image = tf.io.read_file(path)
return preprocess_image(image)

These functions will be used to import images. “preprocess_image” function is used to apply different preprocessing methods to the image. You can find more in the tf.image doct

Step 3: Pipeline building

In this part we build an input pipeline that applies the preprocessing function to “all_train_paths” and creates an iterable (image, label) zip.

path_ds = tf.data.Dataset.from_tensor_slices(all_train_paths)
image_ds = path_ds.map(load_and_preprocess_image)
label_ds = tf.data.Dataset.from_tensor_slices(all_train_labels)

#create (image, label) zip to iterate over
data_label_ds = tf.data.Dataset.zip((image_ds, label_ds))

#Generate a validation set
VAL_COUNT = SIZE_OF_VALIDATION_SET
val_label_ds = data_label_ds.take(VAL_COUNT)
train_label_ds = data_label_ds.skip(VAL_COUNT)

Finaly we create a training and a validation data producer:

#training data producer
tds = train_label_ds.shuffle(VAL_COUNT)
tds = tds.repeat()
tds = tds.batch(BATCH_SIZE)
tds = tds.prefetch(buffer_size=AUTOTUNE)
#tds = tds.cache(filename=’./save/’)

#validation data producer
vds = val_label_ds.shuffle(VAL_COUNT)
vds = vds.repeat()
vds = vds.batch(BATCH_SIZE)
vds = vds.prefetch(buffer_size=AUTOTUNE)

And that’s it !

DEMO

Here is a basic example to test the pipeline

model = tf.keras.Sequential([
tf.keras.layers.Conv2D(filters=32, kernel_size=(3,3)),
tf.keras.layers.MaxPool2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(NUMBER_OF_LABELS, activation=’softmax’)])

model.compile(optimizer=tf.optimizers.Adam(),
loss=tf.keras.losses.sparse_categorical_crossentropy,
metrics=[“accuracy”])

TRAIN_COUNT = len(all_train_path)-VAL_COUNT
BATCH_SIZE = 128

steps_per_epoch = TRAIN_COUNT//BATCH_SIZE
steps_per_validation = VAL_COUNT//BATCH_SIZE

model.fit(tds, epochs=EPOCHS, steps_per_epoch=steps_per_epoch, validation_data=vds, validation_steps=steps_per_validation)

Hope this helps!