How to Reduce Training Time for a Deep Learning Model using tf.data

Original article was published by Renu Khandelwal on Deep Learning on Medium


How to Reduce Training Time for a Deep Learning Model using tf.data

Learn to create an input pipeline for images to efficiently use CPU and GPU resources to process the image dataset and reduce the training time for a deep learning model.

In this post, you will learn

  • How are the CPU and GPU resources used in a naive approach during model training?
  • How efficiently use the CPU and GPU resources for data pre-processing and training?
  • Why use tf.data to build an efficient input pipeline?
  • How to build an efficient input data pipeline for images using tf.data?

How does a naive approach work for input data pipeline and model training?

When creating an input data pipeline, typically, we perform the ETL(Extract, Transform, and Load) process.

  • Extraction, extract the data from different data sources like local data sources, which can be from a hard disk or extract data from remote data sources like cloud storage.
  • Transformation, you will shuffle the data, creates batches, apply vectorization or image augmentation.
  • Loading the data involves cleaning the data and shaping it into a format that we can pass to the deep learning model for training.

The pre-processing of the data occurs on the CPU, and the model will be typically trained on GPU/TPU.

In a naive model training approach, CPU pre-processes the data to get it ready for the model to train, while the GPU/TPU is idle. When GPU/TPU starts training the model, the CPU is idle. This is not an efficient way to manage resources as shown below.

Naive data pre-processing and training approach

What are the options to expedite the training process?

To expedite the training, we need to optimize the data extraction, data transformation, and data loading process, all of which happens on the CPU.

Data Extraction: Optimize the data read from data sources

Data Transformation: Parallelize the data augmentation

Data Loading: Prefetch the data one step ahead of training

These techniques will efficiently utilize the CPU and GPU/TPU resources for data pre-processing and training.

How can we achieve the input pipeline optimization?

Optimizing Data Extraction

Data extraction is optimized by processing multiple files concurrently. tf.data.interleave() optimizes the data extraction process by interleaving the I/O operation to read the file and the map() to apply the data pre-processing.

Source:https://www.tensorflow.org/guide/data_performance#parallelizing_data_extraction

The number of datasets to overlap is specified by the cycle_length argument, while the level of parallelism is set by the num_parallel_calls argument. You can use AUTOTUNE to delegate the decision on the level of parallelism to achieve.

num_parallel_calls spawn multiple threads to utilize multiple cores on the machine for parallelizing the data extraction process by using multiple CPUs.

How to know how many CPUs or cores to use?

You can find the number of cores on the machine and specify that, but a better option is to delegate the level of parallelism to tf.data using tf.data.experimental.AUTOTUNE.

  • AUTOTUNE will ask tf.data to dynamically tune the value at runtime.
  • tf.data will find the right CPU budget across all the tunable operations.
  • AUTOTUNE decides on the level of parallelism for buffer size, CPU Budget, and also for I/O operations.

Parallelize Data Transformation

Image augmentation, part of pre-processing, happens on a CPU. Every augmentation, normalizing, rescaling of the image is a costly operation and will slow down the training process.

What-if you can run all these image operations by utilizing all the cores by processing them parallelly.

tf.data.map() can take the user-defined function containing all image augmentations that you want to apply to the dataset.

tf.data.map() has a parameter num_parallel_calls to spawn multiple threads to utilize multiple cores on the machine for parallelizing the pre-processing using multiple CPUs.

Caching the data

cache() allows data to be cached on a specified file or in memory.

  • When caching in memory, the first time the data is iterated, it will be cached, and all the subsequent iterations, data will be read from the cache.
  • When caching on a file, even the first iteration data will be read from the cached file.
  • Caching produces the same elements for every iteration, use shuffle() to randomize the elements in the iterations after caching the data.

Prefetch the data by overlapping the data processing and training

The prefetching function in tf.data overlaps the data pre-processing and the model training. Data pre-processing runs one step ahead of the training, as shown below, which reduces the overall training time for the model.

Source:https://www.tensorflow.org/guide/data_performance#prefetching

The number of elements to prefetch should be either equal or greater than the batch size used for a single training step. We can use AUTOTUNE to prompt tf.data for dynamically allocating the buffer size value at runtime.

All the operations: map, prefetch, interleave, batch, repeat, shuffle, and cache are part of tf.data that allows you to build

  • Faster and efficient data pipelines by using the computational resources, GPU, CPU, and TPU, efficiently to fetch the data from the data source.
  • Flexible to handle different data formats like the text data, images, and structured data containing numeric and categorical data.
  • Easy to build complex input data pipelines by applying data augmentation, shuffling the dataset, and creating batches of data for training

How to build a data pipeline for a custom image dataset using tf.data?

In this section, you will build a data input pipeline for the popular Cats and Fogs dataset from Kaggle.

Here we will use Transfer learning using MobileNetV2 and TensorFlow 2.3

Importing required libraries

import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)
import numpy as np
import pandas as pd
import pathlib
import os
from os import getcwd
import pandas as pd
from glob import glob
import multiprocessing
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input

Set the train, and val directories for the dataset

train_dir=r'\dogs-vs-cats\train_data'
val_dir=r'\dogs-vs-cats\validation_data'

Convert the files to the Dataset object

Use tf.data.Dataset.list_files() to return filenames based on a matching glob pattern. Here we want all the file from the subfolder under the train_dir and val_dir folder, so we specified “\\*\\*”

train_files = tf.data.Dataset.list_files(str(train_dir + '\\*\\*'), shuffle=False)
val_files = tf.data.Dataset.list_files(str(val_dir + '\\*\\*'), shuffle=False)
#getting the number of files in train and val dataset
train_num_files=len([file for file in glob(str(train_dir + '\\*\\*'))])
val_num_files=len([file for file in glob(str(val_dir + '\\*\\*'))])
print("No. of files in Train folder: ",train_num_files)
print("No. of files in Val folder: ",val_num_files)

Pre-processing the training and validation dataset

Set the Parameters

epoch=10
batch_size = 32
img_height = 224
img_width = 224

Applying MobileNet V2’s pre-processing technique

#Get class names from the folders
class_names = np.array(sorted([dir1 for dir1 in os.listdir(train_dir)]))
class_names
#To process the label
def get_label(file_path):
# convert the path to a list of path components separated by sep
parts = tf.strings.split(file_path, os.path.sep)
# The second to last is the class-directory
one_hot = parts[-2] == class_names
# Integer encode the label
return tf.argmax(tf.cast(one_hot, tf.int32))
# To process the image
def decode_img(img):
# convert the compressed string to a 3D uint8 tensor
img = tf.image.decode_jpeg(img, channels=3)
# resize the image to the desired size
return tf.image.resize(img, [img_height, img_width])
def process_TL(file_path):
label = get_label(file_path)

# load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
img = preprocess_input(img)
return img, label

Optimize the Data Extraction and Data Transformation process by Interleaving

Interleave() will parallelize the data loading step by interleaving the I/O operation to read the file and the map() to apply the data pre-processing to the contents of the datasets.

#Interleaving the train dataset  to read the file and apply preprocessing
train_dataset = train_files.interleave(lambda x: tf.data.Dataset.list_files(str(train_dir + '\\*\\*'), shuffle=True), cycle_length=4).map(process_TL, num_parallel_calls=tf.data.experimental.AUTOTUNE)
#Interleaving the val dataset to read the file and apply preprocessing
val_dataset = val_files.interleave(lambda x: tf.data.Dataset.list_files(str(val_dir + '\\*\\*'), shuffle=True), cycle_length=4).map(process_TL, num_parallel_calls=tf.data.experimental.AUTOTUNE)

The number of datasets to overlap is set to 4, which is specified by the cycle_length argument. The level of parallelism is specified by the num_parallel_calls, which is set to AUTOTUNE

Load the dataset for training

Cache the dataset in memory

##Cache the dataset in-memory
train_dataset = train_dataset.cache()
val_dataset = val_dataset.cache()
train_dataset = train_dataset.repeat().shuffle(buffer_size=512 ).batch(batch_size)

val_dataset = val_dataset.batch(batch_size)

repeat() method of tf.data.Dataset class is used for repeating the tensors in the dataset.

shuffle () shuffles the train_dataset with a buffer of size 512 for picking random entries.

batch() will take the first 32 entries, based on the batch size set, and make a batch out of them

train_dataset = train_dataset.repeat().shuffle(buffer_size=512 ).batch(batch_size)val_dataset = val_dataset.batch(batch_size)

Prefetch function in tf.data overlaps the data pre-processing and the model training

train_dataset =train_dataset.prefetch(tf.data.experimental.AUTOTUNE )
val_dataset =val_dataset.prefetch(tf.data.experimental.AUTOTUNE )

Creating data augmentation for flipping the image vertically and horizontally, rotating the image, Zooming, and applying contrast.

data_augmentation = tf.keras.Sequential([                      tf.keras.layers.experimental.preprocessing.RandomFlip('horizontal'),                      tf.keras.layers.experimental.preprocessing.RandomFlip('vertical'),                      tf.keras.layers.experimental.preprocessing.RandomRotation(0.45),                      tf.keras.layers.experimental.preprocessing.RandomContrast(0.2),                      tf.keras.layers.experimental.preprocessing.RandomZoom(0.1),])

Creating the Transfer Learned model by first applying data augmentation

def create_model():
input_layer = tf.keras.layers.Input(shape=(224, 224, 3))
x= data_augmentation(input_layer)
base_model = tf.keras.applications.MobileNetV2(input_tensor=x, weights='imagenet',include_top=False)

base_model.trainable = False
x = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
x = tf.keras.layers.Dense(2, activation='softmax')(x)

model = tf.keras.models.Model(inputs=input_layer, outputs=x)
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
model= create_model()

Creating a checkpoint threshold, the training will continue until we get a validation accuracy of 99.96% or till the specified number of epochs is completed.

class MyThresholdCallback(tf.keras.callbacks.Callback):
def __init__(self, threshold):
super(MyThresholdCallback, self).__init__()
self.threshold = threshold
def on_epoch_end(self, epoch, logs=None):
val_acc = logs["val_accuracy"]
if val_acc >= self.threshold:
self.model.stop_training = True
my_callback = MyThresholdCallback(threshold=0.9996)

Fit the training dataset to the model

import time
start_time= time.perf_counter()
history_tfdata =model.fit(train_dataset,
steps_per_epoch=int((train_num_files)/batch_size),
validation_data= val_dataset,
validation_steps=int(val_num_files/batch_size),
callbacks=[my_callback],

epochs=epoch)
print(time.perf_counter()-start_time)

If we train the dataset using ImageDataGenerator as shown below, we can compare the difference in the training time.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator, img_to_array, load_img, array_to_img
from tensorflow.keras.models import load_model
from tensorflow.keras import optimizers, callbacks
#Creating Image Train DataGenerator
image_gen_train = ImageDataGenerator(rescale=1./255,
zoom_range=0.1,
rotation_range=45,
shear_range=0.1,
horizontal_flip=True,
vertical_flip=True)
train_data_gen = image_gen_train.flow_from_directory(batch_size=batch_size, directory=train_dir,
shuffle=True, target_size=(224,224), class_mode='sparse')
# Val data generator
image_gen_val = ImageDataGenerator(rescale=1./255)
val_data_gen = image_gen_val.flow_from_directory(batch_size=batch_size,directory=val_dir, target_size=(224,224),class_mode='sparse')
def create_model():
input_layer = tf.keras.layers.Input(shape=(224, 224, 3))
input_layer=preprocess_input(input_layer)

base_model = tf.keras.applications.MobileNetV2(input_tensor=input_layer,
weights='imagenet',
include_top=False)

base_model.trainable = False
x = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
x = tf.keras.layers.Dense(2, activation='softmax')(x)

model = tf.keras.models.Model(inputs=input_layer, outputs=x)
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
model_idg=create_model()start_time2= time.perf_counter()
history = model_idg.fit(
train_data_gen,
steps_per_epoch=len(train_data_gen),
epochs=10,
callbacks=[tboard_callback],
validation_data=val_data_gen,
validation_steps=len(val_data_gen)
)
print(time.perf_counter()-start_time2)

Comparing the time to complete the training using tf.data input pipeline with the training time using ImageDataGenerator

You can see the time to complete the training using tf.data was 290.53 seconds while the training for the same data using ImageDataGenerator was 2594.89 seconds, which is a substantial gain in terms of training time

Code available here

Conclusion:

tf.data allows you to build efficient input data pipelines for different data formats by efficiently using computational resources like GPU, CPU, and TPU, thus reducing the training time.

References:

https://github.com/tensorflow/docs/blob/master/site/en/guide/data.ipynb