Accelerating I/O bound deep learning

When training a neural network, one typically strives to make the GPU the bottleneck. All data should be read from disk, pre-processed, and transferred to the GPU fast enough so that the GPU is busy 100% of the time computing the next improved version of the model.

An increasing trend we see at RiseML is that pre-processing and especially reading the training data from disk becomes the bottleneck. This is caused by multiple factors, including faster GPUs, more efficient model architectures, and larger datasets, especially for video and image processing.

As a result, the GPUs sit idle a lot of time, waiting for the next batch of data to work on. After optimising the pre-processing pipeline, I/O often becomes the next bottleneck. One solution many teams see is to use high performance storage, like solid state drives (SSDs) that are installed in the same server as the GPUs.

However, most teams keep their datasets on a shared network storage and access it via NFS from individual cluster nodes. This makes a lot of sense: training data is not spread to different places and everybody always works on the same, up-to-date version. Especially in this case, reading and streaming training data from this shared storage can quickly become the bottleneck. To cope with this issue, the teams we talked to often manually copy their training data to local solid state disks (SSDs) in order to get the storage performance they need. This doesn’t feel right!

While building RiseML, we stumbled upon an automatic solution for this problem — without side-effects and at no extra cost. It’s so simple we are puzzled why it’s not more well-known and used.

We would like to bring attention to cachefilesd, a Linux user-space daemon to manage caching for network filesystems. Setting it up takes less than a minute and in our experiments (see below) we achieved a speed-up of 377% compared to accessing data directly from shared storage.

How does it work?

Whenever a file is requested via NFS from shared storage, it is cached on a local disk’s filesystem by cachefilesd. Subsequent requests are then served locally and save network latency and bandwidth. Once the cache is full, the least-recently used file gets evicted from it.

With deep learning, training consists of multiple epochs. Typically, every epoch reads the dataset multiple times but in a different order. This access pattern turns out to be a good match for cachefilesd. If the local cache is big enough to hold your dataset, starting from the second epoch all data is locally cached.

Setting up cachefilesd

Installing the cachefilesd daemon on a node requires only two simple steps. To enable caching for a remote filesystem, you need to mount it with the fsc option. If you use RiseML and configured it with caching (see below), it will automatically use this mount option.

1) Install cachefilesd package

$ sudo apt-get install cachefilesd

The default configuration uses up to 90% of the filesystem at /var/cache/fscache. You can change the configuration in the file /etc/cachefilesd.conf to point to a different directory. This blog post gives a nice overview of the configuration with a separate disk.

2) Start cachefilesd

To enable the daemon, you need to edit /etc/default/cachefilesd and uncomment the line that says RUN=YES. Then, restart the service:

$ service cachefilesd restart

Note: Make sure you run cachefilesd on every cluster node.

3) Mount the filesystem with fsc option

When mounting an NFS filesystem on a node, use the fsc option, e.g.:

$ mount -t nfs -o fsc nfs-server:/data /data

If you use RiseML, you need to add the mountOptions section to the specification of the training data volume during installation. This will automatically use the fsc option.

apiVersion: v1
kind: PersistentVolume
name: pv-db
- fsc

A real-world benchmark

To see the effect in a real-world scenario, we benchmarked a model that was generously provided to us by TerraLoupe. The model performs an image segmentation: Given an aerial image, identify roads and buildings on it. The training data consists of ~160k images (150 GB) and corresponding segmentations (see an example below). Aerial imagery was collected from the Open.NRW portal and segmentations were derived from OpenStreetMap.

Example input on the left with desired output segmentation on the right. The output identifies roads and buildings (white)

The model is proprietary, however, it illustrates the problem nicely since the required pre-processing is very efficient and training the neural network itself is very fast and can be parallelized on several GPUs.

We ran our experiments on the Google Cloud Platform on nodes equipped with 4 NVIDIA P100 GPUs each. For cachefilesd, we equipped the nodes with a local SSD of 375 GB and set up caching on the disk with RiseML as described above. Training data was exported from a shared storage via NFS with a 10G Ethernet connection.

The diagram above shows the training speed in images per seconds, with and without using cachefilesd. Without cachefilesd, the model constantly processes around 9.6 images per second. When enabling cachefilesd, during the first epoch, the local cache is being filled up and training speed remains the same. For later epochs, data is read locally from the cache, and speed increases to 36.2 images per second. This is almost 4 times as fast! Of course, YMMV, depending on how fast your network and NFS server are, how big the load is, and what kind of model and pre-processing you use. However, note that this speedup is essentially for free — just by using local storage as a cache!


Using fast local storage as a cache can dramatically increase training performance. Cachefilesd provides an easy and automatic way to manage this cache while imposing no overhead, no extra maintenance, and being very easy to set up. Giving cachefilesd a try is a no-brainer.

Local storage is also cheap. For example, a fast SSD with 375 GB of storage currently costs $30/month on the Google Cloud Platform — negligible compared to the price for a single P100 GPU (~$1000/month). If you buy your own servers, they usually come with fast SSDs that have a lot of unused space.

Thanks to Sebastian Gerke and Nick Harvey for reading drafts of this blog.

About RiseML

Providing fast data access via caching is just one of the few ways RiseML can scale and automate your machine learning workflow. It also lets your team share your clusters’ resources through an interface tailored for machine learning engineers, allowing them to automatically prepare, run, monitor, and scale experiments in parallel using the machine learning framework of their choice. Advanced techniques such as hyperparameter optimization and distributed training can also be easily enabled. We offer a free community edition for individual users and professional and enterprise editions for teams.

Try RiseML now!

Accelerating I/O bound deep learning was originally published in RiseML Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium