Speeding up Pytorch training by 10%

Source: Deep Learning on Medium

Speeding up Pytorch training by 10%

Training deep learning can be time-consuming.

Training a common ResNet-50 model using a single GPU on the ImageNet can take more than a week to complete. To save money and time, it is important to have the correct configuration and parameters.

The num_worker and pin_memory in dataloader can greatly affect the loading time of the data(https://zhuanlan.zhihu.com/p/39752167). The question is , how to easily get the optimal num_worker and pin_memory? I modify a script to help you testing out the optimal params automatically .

Taking the official ImageNet training in PyTorch as an example:

This is the main_worker function in main.py

def main_worker(gpu, ngpus_per_node, args):
global best_acc1
args.gpu = gpu
if args.gpu is not None:
print(“Use GPU: {} for training”.format(args.gpu))
... train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
num_workers=args.workers, pin_memory=True, sampler=train_sampler)
val_loader = torch.utils.data.DataLoader(
datasets.ImageFolder(valdir, transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,
])),
batch_size=args.batch_size, shuffle=False,
num_workers=args.workers, pin_memory=True)
if args.evaluate:
validate(val_loader, model, criterion, args)
return
...

simply add the following script right after train_loader and/or comment out the code after.

 ...
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
num_workers=args.workers, pin_memory=True, sampler=train_sampler)
import time
import multiprocessing
use_cuda = torch.cuda.is_available()
core_number = multiprocessing.cpu_count()
batch_size = 64
best_num_worker = [0, 0]
best_time = [99999999, 99999999]
print('cpu_count =',core_number)
def loading_time(num_workers, pin_memory):
kwargs = {'num_workers': num_workers, 'pin_memory': pin_memory} if use_cuda else {}
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
sampler=train_sampler, **kwargs)
start = time.time()
for epoch in range(4):
for batch_idx, (data, target) in enumerate(train_loader):
if batch_idx == 15:
break
pass
end = time.time()
print(" Used {} second with num_workers = {}".format(end-start,num_workers))
return end-start
for pin_memory in [False, True]:
print("While pin_memory =",pin_memory)
for num_workers in range(0, core_number*2+1, 4):
current_time = loading_time(num_workers, pin_memory)
if current_time < best_time[pin_memory]:
best_time[pin_memory] = current_time
best_num_worker[pin_memory] = num_workers
else: # assuming its a convex function
if best_num_worker[pin_memory] == 0:
the_range = []
else:
the_range = list(range(best_num_worker[pin_memory] - 3, best_num_worker[pin_memory]))
for num_workers in (the_range + list(range(best_num_worker[pin_memory] + 1,best_num_worker[pin_memory] + 4))):
current_time = loading_time(num_workers, pin_memory)
if current_time < best_time[pin_memory]:
best_time[pin_memory] = current_time
best_num_worker[pin_memory] = num_workers
break
if best_time[0] < best_time[1]:
print("Best num_workers =", best_num_worker[0], "with pin_memory = False")
else:
print("Best num_workers =", best_num_worker[1], "with pin_memory = True")
return
...

re-run the code and you will get the params:

In this case, using num_worker = 2 with pin_memory = False is ~ 11% faster than the num_worker = 4 with pin_memory = True (original setting in Github) and ~ 24% faster than the num_woker = 0 with pin_memory = False( default setting).

Let’s verify the result.

Originally the training takes ~0.490s to complete a batch using num_worker = 4 and pin_memory = True.

With the new setting, the training takes only ~0.448s to complete a batch.

The training is faster by ~ 9% !

This can save you a lot of money and time if you are using an AWS GPU server. The result varies according to your situation. You can easily modify the script to boost your Pytorch model training .

As the ImageNet is too large to loop through, I made two assumptions in the script to speed up the searching:

#1 The data loading time is a convex function with respect to num_worker.

#2 The maximum testing worker number = number of CPU cores * 2

Please give me some claps if you find the article useful!