Getting Started with the DIUx xView Dataset for Overhead Object Detection

The new DIUx xView dataset, released March 2018, is one the largest and most diverse public overhead object detection datasets available. It contains more than 1 million labeled objects covering over 1400km² of the earth’s surface. Objects belong to 60 classes, including fine-grained classes. xView targets four computer vision frontiers: improving low-resolution and multi-scale recognition, improving learning efficiency, pushing the limit of discoverable object categories, and improving detection of fine-grained classes.

This post describes how to get started with the DIUx xView dataset, from downloading it to being ready for computer vision model training. You can find more dataset specifics in the official paper.

An example image from the DIUx xView dataset

1. Getting the Data

The xView data will be available through the competition website. We have split the overall dataset into train, validation, and testing segments. The full train set, with images and labels, along with the validation set images, are available for download. The test set is withheld for final competition scoring. The training, testing, and validation sets contain 60%, 21%, and 19% of the dataset, respectively.

2. Displaying Images

Once you’ve downloaded the dataset, open up a jupyter notebook (or other interactive development tool). Some necessary packages for manipulating the data: Python Imaging Library (PIL), NumPy, and matplotlib (for display). With these installed, simply use:

from PIL import Image
import numpy as np
fname = 'chips/104.tif'
img = np.array(
Loading an image from the DIUx xView dataset

That’s it! All of the images have been post-processed to contain maximum rgb color information, so they can be modified just like natural imagery. With the image array you can perform whatever image modifications you need, such as rotations or data type conversion.

3. Ground Truth Labels

The bounding box labels come in GeoJSON format, which can be parsed out of the box using python’s json package. You can parse it like this:

# Processes an xView GeoJSON file
# INPUT: filepath to the GeoJSON file
# OUTPUT: Bounding box coordinate array, Chip-name array, and Class-id array
def get_labels(fname="xView_train.geojson"):
with open(fname) as f:
data = json.load(f)

coords = np.zeros((len(data['features']),4))
chips = np.zeros((len(data['features'])),dtype="object")
classes = np.zeros((len(data['features'])))
    for i in range(len(data['features'])):
if data['features'][i]['properties']['bounds_imcoords'] != []:
b_id = data['features'][i]['properties']['image_id']
val = np.array([int(num) for num in data['features'][i]['properties']['bounds_imcoords'].split(",")])
chips[i] = b_id
classes[i] = data['features'][i]['properties']['type_id']
coords[i] = val
chips[i] = 'None'
    return coords, chips, classes

The field features tells us the number of bounding box labels there are. Under features -> properties we are interested in image_id,type_id, and bounds_imcoords. The field image_id gives us the label’s corresponding file-name, for example ‘83204.tif’. type_id gives us the label class (between 1–100). And bounds_imcoords gives us the relative pixel-coordinates for an axis-aligned bounding box label for the image specified by image_id. The coordinates are in format (xmin,ymin,xmax,ymax).

Image with bounding boxes overlayed

4. Chipping Images and Relative Bounding Boxes

Now you have both the images and bounding boxes ready. Next up comes data-processing. Most object-detection algorithms have set inputs of within the 200–500 pixels² range, but each TIFF image in xView can be upwards of 3000 pixels². There are a variety of ways to modify the images to size. I chose to process the images by chipping them using non-overlapping regions of constant width and height. In my experiments I used square bounding boxes.

Example 300×300 chip with bounding boxes. Different bounding box fills indicate different classes

This method is computationally inexpensive (especially compared to overlapping bounding boxes) and covers the dataset. It is important to note that cropping at low resolutions can cause ground truth truncation.

Note: I also added functionality to save chips at multiple cropping resolutions. The multi-resolution dataset used in the xView baseline experiments was created this way. However, smaller crop resolutions leads to significantly more chips than higher crop resolutions. I limit the number of possible lower-resolution chips proportionally to the number of high-resolution chips. This way, we can avoid chip number imbalance.

5. Saving to TFRecord Format

Now that we have a set of chips and relative bounding boxes, it’s time to save it into a format that can be used by an object detection algorithm. This is highly dependent on your development plan; I am using TensorFlow and will be saving the data into the recommended TFRecord format with the PASCAL VOC label attributes. However, you can also save out the chips as JPEGs and bounding box coordinates into a CSV to be read. I save height, width, image, bounding box, and class label data from each chip into a TensorFlow Example object and then write them to the TFRecord. Code for doing so can be found on our Github page.

At this point, you have processed the xView dataset into a usable format! From here, the algorithmic design and training is up to you. If you would like to see baseline results for xView using the SSD algorithm, see our other blog post.

6. Data Augmentation

For the baseline experiments, I created several different datasets, one of which had augmented chips. There are many different ways to augment images, from changing contrast and HSV to rotations and crops. For my augmentations, I wrote several utility functions to perform: rotation, shifting, salt-and-pepper noise, padding, and gaussian blurring.

Adding per-chip augmentation can increase dataset pre-processing time drastically. Specifically, the rotation function is slow when trying to process hundreds of thousands of chips.

Performing rotations in non-multiples of 90 degrees requires special consideration for bounding boxes. My augmentation code rotates images as well as labels, meaning that bounding boxes rotated at non-multiples of 90 degrees will have a different area than the original. The bounding box center, however, will remain the same.

def rotate_image_and_boxes(img, deg, pivot, boxes):
if deg < 0:
deg = 360-deg

angle = 360-deg
padX = [img.shape[0] - pivot[0], pivot[0]]
padY = [img.shape[1] - pivot[1], pivot[1]]
imgP = np.pad(img, [padY, padX, [0,0]], 'constant').astype(np.uint8)
imgR = Image.fromarray(imgP).rotate(angle)
imgR = np.array(imgR)

theta = deg * (np.pi/180)
R = np.array([[np.cos(theta),-np.sin(theta)],[np.sin(theta),np.cos(theta)]])
newboxes = []
for box in boxes:
xmin, ymin, xmax, ymax = box

xmin -= pivot[1]
xmax -= pivot[1]
ymin -= pivot[0]
ymax -= pivot[0]
bfull = np.array([ [xmin,xmin,xmax,xmax] , [ymin,ymax,ymin,ymax]])
c =,bfull)
c[0] += pivot[1]
c[0] = np.clip(c[0],0,img.shape[1])
c[1] += pivot[0]
c[1] = np.clip(c[1],0,img.shape[0])

if np.all(c[1] == img.shape[0]) or np.all(c[1] == 0):
c[0] = [0,0,0,0]
if np.all(c[0] == img.shape[1]) or np.all(c[0] == 0):
c[1] = [0,0,0,0]
newbox = np.array([np.min(c[0]),np.min(c[1]),np.max(c[0]),np.max(c[1])]).astype(np.int64)
if not (np.all(c[1] == 0) and np.all(c[0] == 0)):

return imgR[padY[0] : -padY[1], padX[0] : -padX[1]], newboxes

7. Conclusion and Getting the Code

We hope you found this blog post helpful for getting started! Feel free to reach out here with any questions. All code described in this post will be accessible here. We have also included a jupyter notebook going through several preprocessing steps.

Source: Deep Learning on Medium