Source: Deep Learning on Medium
TensorFlow 2: Build an image input pipeline with the new Dataset API
TensorFlow’s eager execution is a programming environment that evaluates operations immediately, without building graphs: operations return concrete values instead of constructing a computational graph to run later. This makes a big change from tf.Session.
When to use tf.Data ?
When iterating over training data that fits in memory, feel free to use regular Python iteration. Otherwise, tf.data.Dataset is the best way to stream training data from disk. Datasets are iterables (not iterators), and work just like other Python iterables in Eager mode. You can fully utilize dataset async prefetching/streaming as we will see further.
Step 1: Preparation
We are going to create an input pipeline to feed our model.
First of all we need to create two lists : one for image paths and one for the corresponding labels.
all_train_paths = LIST_OF_IMAGES_PATHS
all_train_labels = LIST_OF_CORRESPONDING_LABELS
Step 2: Preprocessing
image = tf.image.decode_png(image, channels=3)
image = tf.image.resize(image, [HEIGHT, WIDTH])
image /= 255.0
image = tf.io.read_file(path)
These functions will be used to import images. “preprocess_image” function is used to apply different preprocessing methods to the image. You can find more in the tf.image doct
Step 3: Pipeline building
In this part we build an input pipeline that applies the preprocessing function to “all_train_paths” and creates an iterable (image, label) zip.
path_ds = tf.data.Dataset.from_tensor_slices(all_train_paths)
image_ds = path_ds.map(load_and_preprocess_image)
label_ds = tf.data.Dataset.from_tensor_slices(all_train_labels)
#create (image, label) zip to iterate over
data_label_ds = tf.data.Dataset.zip((image_ds, label_ds))
#Generate a validation set
VAL_COUNT = SIZE_OF_VALIDATION_SET
val_label_ds = data_label_ds.take(VAL_COUNT)
train_label_ds = data_label_ds.skip(VAL_COUNT)
Finaly we create a training and a validation data producer:
#training data producer
tds = train_label_ds.shuffle(VAL_COUNT)
tds = tds.repeat()
tds = tds.batch(BATCH_SIZE)
tds = tds.prefetch(buffer_size=AUTOTUNE)
#tds = tds.cache(filename=’./save/’)
#validation data producer
vds = val_label_ds.shuffle(VAL_COUNT)
vds = vds.repeat()
vds = vds.batch(BATCH_SIZE)
vds = vds.prefetch(buffer_size=AUTOTUNE)
And that’s it !
Here is a basic example to test the pipeline
model = tf.keras.Sequential([
TRAIN_COUNT = len(all_train_path)-VAL_COUNT
BATCH_SIZE = 128
steps_per_epoch = TRAIN_COUNT//BATCH_SIZE
steps_per_validation = VAL_COUNT//BATCH_SIZE
model.fit(tds, epochs=EPOCHS, steps_per_epoch=steps_per_epoch, validation_data=vds, validation_steps=steps_per_validation)
Hope this helps!