A File System for Supercomputing and Lay-Programming

Source: Deep Learning on Medium


A BEGINNER’S GUIDE TO

Storing Images in HDF5 Files in Python

Reading about some of the solutions to multiple deep learning and image processing Kaggle competitions, I came across the need to understand how to use Hierarchical Data Formal files (HDF5). These are some of my learnings on the topic, I hope you enjoy it!

In this article, you will learn the following:

  • Why HDf5 files are cool
  • What are HDF5 files, and how they work
  • How to store and retrieve images in an HDF5 file using the h5py library in Python

Why HDF5 files are cool: It was originally developed by The National Center for Supercomputing Applications (pretty cool right?!), and are designed with the intent of storing HUGE amounts of numerical data. The HDF file system was selected by the National Aeronautics and Space Administration (NASA) as the default data and information system for the Earth Observation System(EOS) project.

The image below describes a map of the world with different precipitation levels for satellite imagery swath coverage. HDF5 are the preferred format for scientific projects for their ease of use and large storage capacity.

The image on the right is a visualization of Earth Observation data extracted from an HDF5 file

What are HDF5 files, l and how they work:

Dealing with large amounts of data in a structured fashion is no easy task. Fortunately, HDF5 files help you manage complexity in your data storage and retrieval process under the convenience of a single file.

HDF (HDF4,HDF5, ect) stands for Hierarchical Data File. You can basically store and easily manipulate multiple datasets within a single HDF5 file. In essence:

“An HDF5 file is a container for two kinds of objects: datasets, which are array-like collections of data, and groups, which are folder-like containers that hold datasets and other groups.”

How they work: As mentioned above, and HDF5 can be though of as a folder containing other folders that contain array-like collections of data.

  • An HDF5 file is one large group, and contains multiple groups
  • Each group can contain other groups or datasets
  • Datasets contain array-like collections of data

In this example, we will be converting the dataset from this Kaggle competition into an HDF5 file. The data comes in a ships-in-satellite-imagery.zip file, which contains the shipsnet.zip sub-file with a total of 4,000 satellite images labelled as either “ship” or “no-ship”. The idea is to be able to go from a cumbersome .zip file to a convenient .hdf5 file where all the images are stored in vectorized form.

For now, all you need to know is that I built the data.hdf5 by unzipping the shipsnet.zip file, and processing the images to put them in two different groups (folders) in the data.hdf5 file — ships, and no_ships. There is a total of 4000 observations, with 1000 ship images, and 3000 with no ships.

Retrieving Data from an HDF5 File

We will take a backward approach to understanding how to use an HDF5 file. Here, we are loading in the data.hdf5 and extracting the information from it.

Steps to read from an HDF5 file:

  1. Open hdf5 with ‘r’ reading privileges (line 5)
  2. Access groups within your file just like you would access values in a dictionary, and get the corresponding datasets (line 9)
  3. Index the datasets within every group in the same key-value fashion (lines 11 & 12)
Reading in the datasets stored in the HDF5 file. Each group in the file can be accessed the same way that one would access values in a dictionary.
The hdf5 file acts as a big dictionary where the groups are the keys: <KeysViewHDF5 ['no_ships', 'ships']>
Each group has a dataset associated with it: 
<KeysViewHDF5 ['data']> <KeysViewHDF5 ['data']>
Images are stored as numpy arrays: <class 'numpy.ndarray'>
The shape of the positive cases is: (1000, 80, 80, 3)
The shape of the negative cases is: (3000, 80, 80, 3)

We use h5py as a library to interact with HDF5 files in Python. As seen in the printed results, the keys in data.hdf5 are the groups ships, and no_ships. Getting the keys of these groups in this case returns the datasets which are the Numpy arrays containing the vectorized images.

  • The dataset corresponding to the ship group has a shape of (1000, 80, 80, 3) meaning that there are 1000 images of shape (80,80,3). This will be important to remember in the upcoming step as it will help us understand how datasets are created in HDF5 files.

Now that we have a solid understanding of the structure of an HDF5 file, and how to access its data, we can go into how to write data in it.

Storing Data in an HDF5 File

After downloading the dataset from Kaggle, the next step is to do a little pre-processing to read the images from a directory and extract the labels from the image name. In this code snippet, I have omitted the get_list_paths() and extract_labels() functions for brevity sake. If you want to see the entire process, I recommend you check the repository here.

When building a HDF5 file, the most important thing to understand is that you must establish the shape of the dataset before storing the values.

In this scenario, we want to create an array of arrays that can store multiple images. Since the shape of each image is an array of shape (80,80,3), we need to create a matrix of shape (n, 80,80,3) where n corresponds to the number of images for a given class.

Steps to write to an HDF5 file:

  1. Open hdf5 with ‘w’ writing privileges (line 21)
  2. Create new groups within your file (lines 22 & 27)
  3. Specify the dimensions of the datasets (lines 13 & 14) and create new datasets within each group (lines 23 & 28)
  4. Write the ith image to the ith index of the dataset (lines 25 & 31)
The total number of images is: 4000
The dimensions of the images are: (80, 80, 3)
The shape of the ship dataset is: (1000, 80, 80, 3)
The shape of the no_ship dataset is: (3000, 80, 80, 3)
The name of the ships dataset is /ships/data
The name of the no_ships dataset is /no_ships/data
Checking one element of each dataset (80, 80, 3) (80, 80, 3)

The print statement helps us verify that the data was correctly written into the HDF5 file. The trickiest part of working these files is understanding that there is a trade-off between storage efficiency and flexibility. Though specifying the dimensions of the data being stored might be cumbersome at first, the ease of information retrieval when working with HDF5 compensates it in the long run.

And that’s it! Now we have both successfully retrieved and stored information in an HDF5 file. Perhaps this is even your first step toward working in a Supercomputing Agency like NASA or EOS!

If you would like t read more on HDF5 files, their advantages, when and how to use them, I recommend you look over the following resources:

Here is also a link to the Kaggle Competition and my Github account containing the full code for this example: