XView 2 Challenge | Part 1: Getting Started

Source: Deep Learning on Medium

Dataset Statistics

One of the key steps before training a model is EDA (Exploratory Data Analysis). This step requires exploring the number of data points available per class or group pertaining to the dataset. Luckily for us, there is a section in the paper detailing the statistics regarding the imagery in the dataset. Here are the key details I’ve noted down:

Area of imagery (in km2 ) per disaster event. [Source: xBD Paper]

The imagery is highly unbalanced pertaining to disasters. While the Portugal Wildfire and Pinery Bushfire cover around 8000km², the Mexico Earthquake and Palu Tsunami cover less than 1000km². The Mexico Earthquake and Palu Tsunami, however, make up in the number of polygon annotations in the dataset. They both contain around 100,000 labeled polygon mappings across the dataset.

Polygons in xBD per disaster event. [Source: xBD Paper]
Positive and negative imagery per disaster. [Source: xBD Paper]

We should note that the pre and post-disaster imagery is also unbalanced. For example, as we see in the diagram below most of the dataset contains positive imagery ( post-disaster imagery ). There are only a couple of disaster events that have a balanced set of positive and negative imagery e.g Social Fire, Portugal Wildfire and Woolsey Fire.

Damage classification count. [Source: xBD Paper]

The final diagram shows us the damage classification count. The distribution of the dataset is heavily skewed towards the No Damage class with 313,033 polygons. This is eight times more than the other classes. There are also a handful of annotations that are marked as unclassified.

As stated in the paper, given the unbalanced data, we are presented with a very challenging task of segmenting and classifying the damaged structures within the imagery. I presume this task will require heavy use of data augmentation and other such techniques to create a relatively balanced set for training our model. We shall begin on this task in the next post wherein we will set up our system for training our model. Until then, I encourage you to play around with the dataset on Kaggle Kernels and explore its features. You can even try the baseline repo provided on Github.

I would like to extend a special thanks to Ritwik Gupta for reviewing the post and providing me with valuable inputs before publishing.