Source: Deep Learning on Medium
Historically, we save data so we can read it. Now, with rapid advances in data science, machines are reading our data. We’re shifting from humans accessing small data to deep learning scripts accessing large volumes of data repeatedly.
It’s the difference between loading a patient’s x-ray on a monitor to visually assess it versus running the x-ray 1,000 times through a neural network that converts the raw image to vectors, crops it, rotates it, blurs it, and then tries to identify a pneumothorax.
That’s why we’ve seen increasing usage of GPUs for deep learning — we need to perform more and more math on our data. The compute elements in a GPU are designed for massive, data-parallel math operations that are the foundation of modern graphics pipelines. And, as it turns out, the math behind deep learning workflows.
But it’s not as easy as [data → GPU]
While GPUs perform the complex math of deep learning, each training job is a pipeline with steps before that GPU work.
In training, before the data enters the neural network, it gets preprocessed in a series of manipulations. Preprocessing often includes a series of tasks: indexing (listing) the items in a training dataset, shuffling them, actually loading them from storage, and then performing distortions. Preprocessing usually occurs on CPUs (frequently, on the CPUs within a GPU server).
So data has to move from where it’s stored, through a CPU, to the GPUs for computation.
The data load itself is often the cause of the largest throughput bottleneck for training pipelines.
If you were loading a patient’s x-ray to look at it, you might find a 0.04 sec data load time acceptable. You’d hardly notice the latency because your “review” time might be several minutes long.
On the other hand, a GPU server that only takes 0.001 sec to “review” the image would be slowed down by a 0.04 sec data load time. And the delay compounds. With massive training datasets that contain not 1 item but thousands–or millions–and iterative computations (e.g. 20 epochs in a single job), data loading delays are incurred a lot of times.
Regardless of how fast your GPUs are, your training job can only be as fast as your data load time.
So, there’s a massive performance benefit in ensuring that data loading is efficient.
This post describes our project with Geisinger’s Department of Imaging Science and Innovation to optimize training throughput for a dataset of electrocardiogram results. We produced a simulated dataset matched in data size and type for optimization testing.
First, identify the performance baseline
A good place to start a DL performance investigation is to check the ratio between duration of data load work and duration of training work.
After we ran this basic task breakdown, we saw that data load time was significantly larger than training time. This is highly inefficient because, with a train:load ratio of 1:4, our GPUs are idle for ¾ of the job duration.
Every batch, 75% of the GPU time was spent idle while we waited for the next batch to load.
We need to be loading data a lot faster.
Why are files slow to load?
Our synthetic training data represented logs from 15-lead ECG tests: 15x 5GB HDF5 files, each with 1,000,000 records (patients). Each file represented a single ECG channel, inside of which a ‘patient_id’ was the key and that channel’s results were its record.
We were treating HDF5 like a key-value store, but the file format is actually designed to support more complex data structures. HDF (Hierarchical Data Format) is great for storing large numerical datasets that have extensive metadata. It enables grouping within the file, effectively making an HDF file a portable file system. HDF is primarily used for scientific datasets in HPC use cases.
A downside of that functionality, however, is that there’s significant overhead per read of an HDF5 file.
‘lseek’ is used to move the file pointer to specific locations within the file, then ‘read’ is used for actually reading data. The final line is a read of 4992 bytes to retrieve one value in the key-value pairs that comprise our HDF5 file. The prior 5 reads are of metadata (pointers) inside the file that are required to locate the desired data.
Why are there so many metadata reads? Because HDF uses a B-Tree structure to help increase performance of its file-system-like metadata queries. Just like on a filesystem, knowing where to look for data can be a harder problem than moving the data into a CPU.
After seeing how many redundant system calls are needed for HDF reads, we were convinced that there was a more read-performant way to save our simplistic dataset. We weren’t really using the hierarchical functionality that gives HDF its name.
What would optimal look like?
As a thought experiment, we started with the question, “What’s the best possible layout for key-value data inside a file for maximum read throughput?”
What are we solving for?
1. We don’t need to save complex metadata
The only descriptive information we’re using for the records is the key.
Let’s move away from the functionality in HDF that we’re not making use of.
2. Fast reads for both individual records and for collecting the list of keys
As is common during training jobs, our scripts start by shuffling the dataset. The dataset’s contents must be enumerated before those items can be shuffled. Since this enumeration step occurs at the beginning of every job, we don’t want to build in front-end delays by having slow key collection (which we had seen with HDF5).
3. The larger the IO, the better
We’d prefer to train from datasets saved on remote storage. Why? First, data management is simpler, and we don’t have to spend time copying datasets to local devices ahead of time. Second, our dataset was too large to fit in local memory in its entirety, so we’d have to manage some data sharding process if we trained from a local storage location.
So, since we’ll be issuing reads that have to go out over the network to storage, we might as well optimize the cost of that network round trip by making the read size larger.